PhonoByte: Short Summary

BitNet Distillation: The 3-Stage Pipeline That Delivers Full-Precision LLM Performance at 1.58-bit Ultra-Low Quantization

Sankar Mukherjee — Fri, 17 Oct 2025 13:05:33 GMT

BitNet Distillation Paper: [Link] Github: [Link]

The Problem

The core challenge addressed is the difficulty of deploying Large Language Models (LLMs) in downstream applications, especially on resource-constrained devices like smartphones, due to their rapidly escalating size, which leads to prohibitive memory consumption and computational overhead.

Specifically related to ultra-low precision quantization:

Existing extreme low-bit LLMs, such as the 1.58-bit (ternary values ${-1, 0, 1}$) BitNet, require pretraining from scratch on large-scale corpora to achieve competitive accuracy, which incurs substantial computational and energy overhead.
Directly fine-tuning existing full-precision LLMs into 1.58-bit precision (referred to as BitNet-SFT) for specific downstream tasks is often unstable, results in significant performance degradation, and exhibits poor scalability. The performance gap relative to the full-precision baseline tends to widen as the model size increases (e.g., from 0.6B to 4B, the gap relative to FP16 baseline grows from 13.9 to 15.3 in one experiment).
Key challenges identified when attempting to fine-tune pre-trained full-precision LLMs into 1.58-bit BitNet include performance degradation, poor scalability, and training instability.

The Solution Suggested

The solution proposed is BitNet Distillation (BitDistill), which is described as a tailored distillation framework and a scaling-friendly Quantization-Aware Training (QAT) framework.

BitDistill is a lightweight pipeline designed to fine-tune off-the-shelf full-precision LLMs (such as Qwen) into 1.58-bit precision (ternary weights ${-1, 0, 1}$) for specific downstream tasks.
The goal of BitDistill is to achieve strong task-specific performance with minimal computational cost and to bridge the gap between extreme 1.58-bit quantization and practical deployment.
BitDistill allows 1.58-bit quantized LLMs to achieve downstream performance comparable to their full-precision counterparts.

How the Solution Was Achieved

BitDistill is a three-stage training pipeline designed to mitigate the identified challenges of instability and poor scalability:

Stage 1: Modeling Refinement

This stage addresses optimization instability caused by excessively large activation variance in low-bit quantized models.

The architecture of the LLM is modified by integrating the SubLN module.
SubLN layers are inserted at specific positions inside each transformer block, specifically right before the output projection of the Multi-Head Self-Attention (MHSA) module and before the output projection of the Feed-Forward Network (FFN).
This design ensures that hidden representations entering quantized projection layers are variance-stabilized, preventing activation scale explosion and improving training stability and task performance.

Stage 2: Continued Pre-Training (Crucial Warm-up Step)

This stage mitigates the poor scalability issue that arises when directly fine-tuning full-precision weights to 1.58-bit representations using limited downstream training tokens.

A small amount of pretraining corpus (e.g., only 10B tokens sampled from the FALCON corpus) is used to fine-tune the modeling-modified LLMs from Stage 1. This cost is virtually negligible compared to pre-training 1.58-bit BitNet from scratch (approximately 4T tokens).
This warm-up step enables the BitNet models to rapidly adapt to a feature space better suited for 1.58-bit optimization, making the weight distribution more similar to BitNet trained from scratch (concentrating weights near transition boundaries), thus improving downstream performance and preventing convergence to suboptimal local minima.

Stage 3: Distillation-based Fine-tuning

This stage recovers the accuracy of the full-precision teacher model by incorporating two types of knowledge distillation during the downstream task fine-tuning phase.

Logits Distillation (LD): This technique minimizes the Kullback–Leibler divergence between the output probability distributions (logits) of the full-precision teacher (FP16) and the 1.58-bit student.
Multi-Head Attention Distillation (AD): Based on the MiniLM series, this encourages the 1.58-bit student to capture the fine-grained structural dependencies embedded in the FP16 teacher’s attention patterns. This distillation is typically performed only at a single layer rather than across all layers to provide greater optimization flexibility.

The total loss for Stage 3 combines the standard cross-entropy loss ($L_{CE}$), the logits distillation loss ($L_{LD}$), and the attention distillation loss ($L_{AD}$).

The Results of These Solutions

BitDistill achieved significant efficiency gains while maintaining high performance:

Performance: BitDistill achieved performance comparable to full-precision counterparts across model size and benchmarks (classification and summarization), with only marginal differences observed in most cases.
- For example, on the 4B model classification task (MNLI), FP16-SFT achieved 91.48 accuracy, while BitDistill achieved 91.40.
- BitDistill models consistently achieved performance close to full-precision fine-tuning across alternative backbones tested (Gemma and Qwen2.5).
- The framework can effectively leverage a higher-quality FP16 teacher (e.g., distilling a 0.6B student with a 4B teacher) to achieve greater downstream task gains, sometimes even surpassing FP16 models of the same size.
Memory Efficiency: BitDistill enabled up to 10× memory savings. For instance, a 0.6B Qwen3 model reduced memory usage from 1.20 GB (FP16-SFT) to 0.11 GB (BitDistill).
Inference Speed: BitDistill enabled 2.65× faster inference on CPUs compared to 1.58-bit BitNet, and achieved up to a 2× inference speedup compared to FP16 baselines. For example, inference speed increased from 427 tokens/s (FP16-SFT) to 1,135 tokens/s (BitDistill).
Scalability: Unlike direct fine-tuning (BitNet-SFT), BitDistill preserves scalability, allowing performance to remain comparable to full-precision counterparts across all model sizes.
Ablation Studies: All three stages (Modeling Refinement, Continued Pre-Training, and Distillation-based Fine-tuning) were found to be complementary, as excluding any one stage led to a non-trivial drop in downstream performance. The combination of Logits Distillation and Multi-Head Attention Distillation provided the most consistent performance.
Compatibility with Quantization Techniques: Further investigation demonstrated that BitDistill is complementary to and compatible with various existing post-training and weight-quantization methods (such as Block-Quant, GPTQ, and AWQ). Models consistently benefited from the proposed framework regardless of the underlying quantization method, often matching the full-precision baseline.

MOSS-Speech

Sankar Mukherjee — Tue, 07 Oct 2025 07:43:36 GMT

🧠 Spoken dialogue systems have advanced toward end-to-end designs, yet most still rely on intermediate text guidance, creating bottlenecks in latency and expressivity.

⚡Preliminary studies reveal that hidden-state alignment between speech and text deteriorates and diverges in deeper layers of a joint Transformer backbone. Why should a single shared architecture handle modality-specific generation in its final blocks? 🤔 This divergence risks degrading the pretrained text LLM’s reasoning and knowledge when new speech capabilities are added.

🚀 The core innovation is MOSS-Speech — overcoming this by implementing a modality-based layer-splitting design with a frozen pre-training strategy ❄️.

The model routes the shared hidden state — after leveraging joint fusion in early layers — into separate modality-specific branches (text 📝 and speech 🔊) starting at the 32nd of 36 Transformer blocks.

This design solves cross-modal misalignment in deep layers by reserving final layers for modality-specific generation, enabling deep fusion early while preventing divergence near output 🔄.

Critically, this approach, with frozen text parameters, mitigates reasoning and knowledge degradation typically seen when extending LLMs to new modalities 🧩.

As a full streaming architecture, the encoder’s WER of 10.80% is slightly higher than 9.17% for block-causal GLM-4-Voice. The decoder, though achieving better intelligibility and quality 🌟, shows a marginal trade-off in speaker similarity on English and Chinese benchmarks versus CosyVoice 2.

Model initialization used the Qwen-3-8B backbone, trained across two stages using ~4M hours of speech data ⏱️ — 690k hours English and 952k hours Chinese interleaved speech–text — followed by two epochs of supervised fine-tuning on 1.5M synthetic multimodal QA pairs 🎧📚.

The pre-trained model preserved textual ability and improved speech modeling, gaining 69.53 on Chinese spoken StoryCloze vs 54.39 for GLM-4-Voice 📈.

In supervised fine-tuning, it achieved SOTA spoken QA (77.33 on LLaMA-QA) and high perceived speech quality (4.37 UTMOS) 🏆.

This architecture enables reasoning transfer to speech while paving the way for speech-native models with seamless, expressive human–AI interaction 🤝💬.

🔗 More: arxiv.org/abs/2510.00499

Revolutionizing Text-to-Speech: How Differentiable Reward Optimization is Changing the Game

Sankar Mukherjee — Fri, 11 Jul 2025 03:12:15 GMT

Paper

The world of text-to-speech (TTS) synthesis has been transformed by large language models, but training these systems to produce high-quality, controllable speech has remained a significant challenge. Traditional reinforcement learning from human feedback (RLHF) approaches for TTS are computationally expensive and complex. However, groundbreaking research from Alibaba's Tongyi Lab introduces a novel solution: Differentiable Reward Optimization (DiffRO).

The Problem with Traditional TTS Training

Current neural codec language model-based TTS systems face several hurdles when implementing RLHF:

Computational Overhead: Unlike text-based language models, TTS systems require additional backend flow matching and vocoder models to convert discrete tokens into audio. This creates a massive computational burden when generating training data.

Limited Sample Diversity: Generated TTS samples often exhibit high similarity, making it difficult to distinguish between positive and negative examples for reward model training.

Complex Evaluation: TTS quality depends on multiple factors—pronunciation accuracy, naturalness, speaker similarity, and emotional expression—making simple binary classification insufficient.

Introducing DiffRO: A Paradigm Shift

DiffRO addresses these challenges through three key innovations:

1. Token-Level Reward Prediction

Instead of synthesizing audio to evaluate quality, DiffRO predicts rewards directly from neural codec tokens. This approach leverages the fact that codec tokens should contain all necessary information from the input text. By using an ASR-style approach to predict the original text from tokens, the system can assess pronunciation accuracy without expensive audio synthesis.

2. Differentiable Training Process

Traditional RLHF requires complex reinforcement learning loops. DiffRO uses the Gumbel-Softmax technique to make the reward function differentiable, enabling direct optimization through standard backpropagation. This eliminates the need for Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO) strategies.

3. Multi-Task Reward (MTR) Model

Rather than focusing solely on pronunciation, DiffRO employs a multi-task reward model that simultaneously handles:

Automatic Speech Recognition (ASR) for pronunciation accuracy
Speech Emotion Recognition (SER) for emotional expression
Speech Quality Assessment (SQA) for overall audio quality
Age and Gender prediction for speaker characteristics

Impressive Results Across Multiple Dimensions

The researchers tested DiffRO on the challenging seed-tts-eval benchmark, demonstrating significant improvements across multiple languages and evaluation criteria:

Pronunciation Accuracy: The system achieved state-of-the-art word error rates across Chinese, English, and challenging text cases, with particularly impressive results on hard-to-pronounce content.

Cross-Lingual Performance: Remarkably, DiffRO improved performance on Japanese and Korean without language-specific training data, suggesting the method helps models learn universal pronunciation principles.

Emotional Control: The multi-task approach enabled sophisticated emotional expression control, with the system achieving high accuracy across Happy, Sad, and Angry categories. Most impressively, the model learned to generate natural audio events like laughter and breathing sounds without explicit training.

Competitive Performance: DiffRO outperformed existing models including F5-TTS and GPT-SoVITS on emotion benchmarks, demonstrating its practical superiority.

Beyond the Technical Achievement

What makes DiffRO particularly exciting is its potential to democratize high-quality TTS development. By reducing computational requirements and simplifying the training process, this approach makes advanced speech synthesis more accessible to researchers and developers with limited resources.

The method's ability to control multiple speech attributes simultaneously—from pronunciation accuracy to emotional expression—opens new possibilities for personalized AI assistants, content creation tools, and accessibility applications.

Looking Forward

While DiffRO represents a significant leap forward, the researchers acknowledge areas for future development. The current approach primarily optimizes the language model component, while speaker-related attributes are more heavily influenced by the flow matching and vocoder stages. Future work aims to extend DiffRO to these components for even more comprehensive control.

The success of DiffRO suggests we're entering a new era of efficient, controllable speech synthesis. As the demand for natural, emotionally-aware AI interactions continues to grow, methods like DiffRO will be crucial for building the next generation of voice-enabled applications.

This research represents a fundamental shift in how we approach TTS optimization, moving from expensive, complex training procedures to efficient, direct optimization methods. The implications extend far beyond technical improvements—they point toward a future where high-quality, controllable speech synthesis becomes a standard capability rather than a specialized achievement.

Anthropic’s Modular Code Protocol (MCP)

Sankar Mukherjee — Sun, 01 Jun 2025 07:10:54 GMT

🧱 LLMs are powerful—but brittle. Anthropic’s Modular Code Protocol (MCP) introduces a clean way to build AI agents that are safe, modular, and production-ready.

Most agent systems today tangle prompts, tool logic, and data access. MCP offers a better blueprint by separating model reasoning, application execution, and user intent into three distinct, composable layers.

🔧 Tools (Model-controlled)
Functions the model can invoke, not generate — e.g.
• create_invoice()
• reply_to_customer()
• flag_transaction()
Tools give the model agency through well-defined actions.

📂 Resources (Application-controlled)
Scoped data made available to the model, such as:
• CRM records
• Order history
• Account ledgers or transaction logs
Resources are visible to the model but immutable — keeping data access safe and auditable.

🧠 Prompts (User-controlled)
Reusable, structured templates that describe what the model should do, like:
• “Summarize a customer complaint”
• “Extract payment info into JSON”
• “Suggest an action plan for overdue invoices”

⚙️ The MCP design is client-server:

The MCP Server exposes tools, resources, and prompt templates.
The MCP Client assembles these at runtime, calling the model as needed.

💡 Customer Support Example: Summarize the last 5 support tickets (prompt), read a customer's complaint history (resource), and respond via Zendesk (tool).
💡 Finance Agent Example: Analyze quarterly reports (resource), extract KPIs (prompt), and generate an invoice summary for review (tool).

✅ Why MCP Matters for Real-World Agents
• Modularity – Swap in tools or data views without rewriting the whole system
• Governance – Models never get raw access to databases or APIs
• Safety – Logic lives in code, not fragile prompt hacks
• Observability – Each step (data → decision → action) is transparent

This model is language-agnostic and future-proof. LLMs become reasoning engines, while execution and intent stay in human hands.

MCP is ideal for teams building internal copilots, compliance-aware finance agents, or customer support assistants that need traceability, control, and adaptability.

📌 If you’re building with OpenAI, Claude, or local LLMs, this is a pattern worth adopting early.

More details:

A Glimpse into the First Duplex S2S Model Without Speech Pre-training

Sankar Mukherjee — Wed, 28 May 2025 12:06:28 GMT

Interacting with AI through speech can feel unnatural due to delays and rigid turn-taking.

🎙️ NVIDIA researchers have taken a significant step toward more fluid, real-time spoken dialogue with a novel duplex speech-to-speech (S2S) model designed for seamless interaction — including robust user barge-in.

🔍 What’s new:
Introducing a novel duplex S2S architecture that directly models simultaneous user and agent streams using channel fusion.
✨ It's the first duplex S2S model demonstrated without requiring speech pretraining, simplifying the development process from any Large Language Model (LLM).
📦 To support reproducibility, this is also the first openly available duplex S2S model with full training and inference code.

🛠️ How it works:

Simultaneously takes continuous user and agent speech/text streams.
Uses channel fusion to model concurrent inputs.
User speech is processed by a pre-trained streaming encoder.
Agent output is generated using a codec (default: personalised NanoCodec).
Trained via multi-channel next token prediction, handling text and speech in parallel with turn-level alignments.
Leverages a causal speech encoder and text LLM for streaming capabilities.

📈 Results:

Outperforms previous duplex models like Moshi.
🗣️ Barge-in success rate: 94.5% vs. 55.1%
⏱️ Barge-in latency: 0.69s vs. 0.81s
🔊 Higher speech quality: greater than 4.0 UTMOS
🧠 Better reasoning scores across evaluation datasets
⚡ Lower bitrate: 0.6 kbps — half of some prior work
💡 Codec personalization boosts performance, enabling it to beat higher-bitrate codecs on reconstruction metrics.

🌍 Why it matters:
This research tackles key limitations of traditional turn-based systems, enabling natural, responsive, and efficient human-computer dialogue.

By removing the need for extensive speech pre-training, it democratises real-time S2S development, making it easier to adapt a variety of LLMs for spoken conversations.

With open-source availability, lower resource requirements, and state-of-the-art conversational performance, this marks a major milestone in the journey to next-gen, highly interactive AI agents.

📏 The paper also introduces systematic metrics for evaluating conversational behaviour’s like turn-taking and barge-in, pushing the field forward.

More Details: https://arxiv.org/abs/2505.15670

Architecting Real-time Speech Interaction: Step-Audio's Approach to Seamless Tool Calling

Sankar Mukherjee — Wed, 21 May 2025 14:20:14 GMT

Building truly intelligent real-time speech interaction systems presents significant technical hurdles. A key challenge arises when the system needs to access external information or services – performing "tool calls" or "API integrations". Traditionally, integrating these steps into a cascading ASR-LLM-TTS pipeline could introduce latency, requiring the system to wait for API responses before generating the final spoken output.

Step-Audio, the first production-ready open-source framework for intelligent speech interaction, addresses this with an innovative asynchronous tool invocation mechanism. Rather than a simple sequential process, the architecture specifically decouples the text-based tool processing from the audio generation pipelines.

Why is this decoupling significant?

Real-time text responses and their corresponding audio streams have a substantial bitrate disparity. By separating the processes, Step-Audio allows for the parallel execution of external service queries (like knowledge retrieval) and speech synthesis.

The result?

This design eliminates waiting time for audio rendering when a tool call is required, significantly enhancing interaction fluidity. The system manages complex tasks by augmenting its cognitive architecture with this crucial tool calling ability.

This architectural choice is a technical leap forward, enabling smoother, more natural conversational dynamics even when the system needs to look up information or interact with external systems.

More Details: https://arxiv.org/abs/2502.11946

VITA-Audio: Real-Time Speech Generation, Redefined

Sankar Mukherjee — Wed, 14 May 2025 13:31:12 GMT

Introducing VITA-Audio, a groundbreaking end-to-end speech model that dramatically reduces latency in real-time speech applications.

🌟 Why it matters:
Traditional systems suffer from high first-token delay, limiting truly natural voice interactions. VITA-Audio is the first multi-modal LLM capable of generating audio output during the initial forward pass — achieving zero audio token and generation delay.

🛠️ How it works:

The core innovation is the Multiple Cross-modal Token Prediction (MCTP) module.
The MCTP modules were designed based on attention map visualizations and masking experiments, which revealed that LLM hidden states contain enough localized context for audio token generation.
These findings showed that the speech model focuses mainly on nearby text hidden states rather than the full semantic context, enabling the use of lightweight modules for audio prediction.
Inspired by the isomorphic Multi-Token Prediction (MTP) framework used in models like DeepSeek V3, MCTP modules generate audio tokens efficiently without extra LLM passes.
End-to-end architecture with audio encoder, LLM backbone, and audio decoder.
Interleaved text-audio token modeling retains language quality.
Progressive four-stage training ensures accurate speech synthesis.

⚡ Results:

Up to 5x faster inference (Turbo mode).
First audio chunk latency slashed from 236ms to 53ms.
State-of-the-art results on speech-to-speech tasks and top-tier TTS quality.
Fully open-source and reproducible.

🔊 Why it’s a game changer:
VITA-Audio unlocks real-time speech-to-speech interaction for assistants, agents, and conversational AI — setting a new standard for latency-sensitive applications.

📂 More details: https://github.com/VITA-MLLM/VITA-Audio

How Voila Speaks with Nuance: The Power of Structured Interleaved Alignment

Sankar Mukherjee — Mon, 12 May 2025 13:30:52 GMT

Voila is a new family of voice-language foundation models. At the heart of Voila’s ability to generate expressive and synchronized speech lies a key innovation: Structured Interleaved Alignment — a smarter way to connect what’s said with how it sounds.

To enable a large language model (LLM) to understand and generate voice, Voila first transforms raw audio into discrete semantic and acoustic tokens using its neural audio codec, the Voila-Tokenizer. These tokens become part of the LLM's vocabulary — making audio a first-class citizen in language understanding and generation.

But here’s the twist 👇

Rather than batching all the text followed by all the audio (as many models do), Voila interleaves them in a structured, fine-grained sequence:
➡️ "" +
➡️ "" +
➡️ "" +
➡️ "" +

This tight pairing of each semantic unit with its corresponding acoustic representation creates a clear and explicit alignment — crucial for generating nuanced speech that feels natural and well-paced. This design differs from prior approaches, such as Spirit-LM and Unified Spoken Dialog Model (USDM), which also adopt interleaved text-audio formats but do so with looser coupling.

✨ Why does this matter?

It strengthens the LLM’s understanding of how meaning and sound relate.
It simplifies training by reducing the ambiguity of loose alignments.
It empowers Voila to handle TTS, voice-based instructions, and multimodal reasoning with remarkable clarity.

Once interleaved, the tokens are embedded — with text tokens repeated to match multi-layer audio tokens — averaged, and passed through the LLM. The output is then decoded by an audio transformer and finally rendered into high-quality speech by the Voila-Tokenizer.

This alignment strategy isn’t just a technical detail — it’s a foundational innovation that unlocks more human-like interaction between language and voice.

🔊 Welcome to a new era of audio-native AI.

More details: https://github.com/maitrix-org/Voila

INTP (Intelligibility Preference Speech Dataset): Making Zero-Shot TTS More Robust and Intelligible

Sankar Mukherjee — Sun, 11 May 2025 16:12:36 GMT

Despite impressive progress, zero-shot Text-to-Speech (TTS) models still struggle with challenging linguistic scenarios — think tongue twisters, repeated words, code-switching, and cross-lingual synthesis. A recent study directly tackles these weaknesses through preference alignment and a new dataset: INTP (Intelligibility Preference Speech Dataset).
🔍 What’s New:
Introduced INTP, a curated set of ~250K preference pairs targeting tough intelligibility cases.
Adapted Direct Preference Optimization (DPO) to align multiple TTS architectures: AR, Flow-Matching, and Masked Generative.
Preference signals were constructed using WER-based intra/inter-model comparisons and LLM-generated perturbations.
📈 Why It Matters:
Across diverse zero-shot TTS models (e.g., ARS, F5-TTS, MaskGCT, CosyVoice 2, Ints), observed significant WER reductions and improved naturalness (N-CMOS), without sacrificing speaker similarity.
INTP shows weak-to-strong generalization—enhancing even strong models it wasn’t trained on.
🌀 Its also demonstrated a flywheel effect—a scalable loop of data and model improvement through iterative preference alignment.
📢 The authors will release the INTP dataset, DPO-based alignment code, and improved model checkpoints in Amphion toolkit to support further research.
More details: https://intalign.github.io/

Ready to hear your LLM speak? Breakthrough in efficient Text-to-Speech with PEFT!

Sankar Mukherjee — Mon, 05 May 2025 13:30:32 GMT

Large Language Models have redefined text-based AI, but giving them a voice without costly retraining has been a hurdle. Traditional methods often require building speech models from scratch or computationally heavy full fine-tuning, sometimes impacting the LLM's original text prowess.
🎤 Introducing TTS-Llama — a breakthrough approach that brings voice to LLMs using Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA. No need to retrain the whole model or sacrifice text performance.
💡 What’s New:
With LoRA, we fine-tune just a tiny fraction of the LLM’s parameters — specifically the input embeddings, prediction head, and injected adapters — keeping the original model frozen. The result?
✅ State-of-the-art TTS with minimal compute
✅ No compromise on the LLM’s original capabilities
🔧 How It Works:
Text In → A fine-tuned Llama 3-8B generates high-level semantic tokens that capture meaning and prosody.
Semantic Tokens → Acoustic Features: An acoustic language model (MusicGen) decodes these into fine-grained audio representations.
Audio Out: A neural vocoder synthesizes the final waveform.
This modular design is powered by a semantic tokenizer (extracting 4,096 discrete tokens) and an acoustic tokenizer (via convolutional autoencoder with residual quantization).
📊 Training:
🗣 50K hours of LibriHeavy speech + 60K synthetic sentences for text normalization.
🧮 1.4B trainable parameters in TTS-Llama.
🖥 Trained on 8× H100 GPUs, batch size 256, 20K steps.
🔥 Results:
TTS-Llama achieved state-of-the-art zero-shot TTS:
Human-likeness MOS: 3.07
Audio quality MOS: 3.47
(Beating VoiceCraft's 2.85 and 3.17)
It also excelled in TTS performance on text normalization tasks.
🌍 Why It Matters:
This work proves you don’t need to build speech models from scratch. By applying PEFT to existing LLMs, we can inject new modalities like speech efficiently and scalable. It's a massive step toward truly multimodal AI.
📄 More here: https://arxiv.org/abs/2410.20336

Rethinking Efficiency in State-of-the-Art Neural Codecs: Variable Bitrate Residual Vector Quantization (VRVQ)

Sankar Mukherjee — Fri, 02 May 2025 13:31:02 GMT

🔍 What’s New?
The cutting edge of neural audio compression has largely settled on Residual Vector Quantization (RVQ), showcasing impressive performance at low bitrates.
But in a world of diverse audio content, from complex music to simple silence, can a codec truly be optimal when it rigidly uses the same number of codebooks for every frame?
Introducing Variable Bitrate Residual Vector Quantization (VRVQ) — a novel approach that brings true variable bitrate coding to RVQ-VAE-based audio codecs for the first time.
Unlike conventional codecs that use a fixed number of codebooks per frame, VRVQ dynamically adjusts codebook usage based on signal complexity. This makes audio coding more efficient, especially for simple content like silence or low-energy frames.
🧠 How It Works
VRVQ approach introduces an importance map derived from an intermediate feature map of the encoder.
This importance map is used to generate a time-varying binary mask that controls which codebooks are used for each frame.
To enable effective training of this importance map alongside the codec, a novel gradient estimation method for the non-differentiable masking operation, utilizing a smooth surrogate function was developed.
The entire framework is jointly trained using a rate-distortion loss optimization.
📊 Results
VRVQ significantly outperforms a state-of-the-art DAC baseline in SI-SDR under many conditions.
Scalability: Increasing codebooks from 8 → 16 improved performance further and matched constant bitrate baselines.
⚠️ Limitation:
At very low bitrates, the mask overhead slightly reduces variable bitrate gains — but the tradeoff is minimal for practical use cases.
📌 Why It Matters
Fixed-bitrate codecs waste capacity on trivial audio. VRVQ solves this — offering smarter, adaptive compression for neural audio systems. It's a step toward more flexible, efficient, and high-quality audio codecs grounded in modern machine learning.
🎧 Explore the project + audio samples:
🔗 https://yoongi43.github.io/VBRRVQ.github.io/

Muyan-TTS: Open-Source LLM-Based TTS Optimized for Podcasting—Built on a $50K Budget

Sankar Mukherjee — Wed, 30 Apr 2025 13:30:29 GMT

The Text-to-Speech (TTS) landscape is undergoing a transformation—powered by Large Language Models (LLMs) that significantly improve semantic understanding and speech naturalness. But despite these advancements, most LLM-based TTS systems remain closed-source, resource-intensive, or ill-suited for real-world audio applications like podcasting.
Introducing Muyan-TTS, a fully open-source, trainable TTS system purpose-built for podcast scenarios—achieving high naturalness, zero-shot synthesis, and fast inference within a $50,000 total training budget.
🔍 Key Innovations:
✅ LLM-Based Synthesis: Uses a pre-trained LLaMA-3.2-3B as a replacement for traditional AR models—bridging the gap between LLM expressiveness and TTS fidelity.
✅ Parallel Audio-Text Tokenization: Text is tokenized with the LLM’s tokenizer, while audio is quantized using GPT-SoVITS audio tokens, enabling alignment of speech and text representations.
✅ VITS-Based Decoder: Fine-tuned for podcast audio to mitigate hallucinations and enhance pronunciation via a structured G2P module.
⚙️ System Overview:
📥 Data Pipeline: 150K+ hours of multilingual raw audio collected and filtered down to 100K+ high-quality hours using Whisper, FunASR, and MOS/NISQA thresholds. Audio cleaning used Music Source Separation, DeEcho, DeReverb, and more. (~$30K in GPU compute)
🧠 LLM Pre-training: 15 epochs on 80 A100 GPUs—training LLaMA-3.2-3B on audio-text pairs via 1024 learned audio tokens. (~$19.2K)
🎯 SFT (Speaker Adaptation): Efficient post-training using a few minutes of target speaker data (~15 mins/GPU node per hour of speech).
🔊 Decoder Training: Fine-tuned SoVITS on a curated 10K-hour MOS>4.5 podcast set. (~$1.34K)
📊 Evaluation Highlights:
📈 WER: 3.44% (LibriSpeech), 4.09% (SEED) — outperforming Spark-TTS, FireRedTTS, and GPT-SoVITS.
🗣️ MOS: 4.58 (LibriSpeech), 4.32 (SEED) — strong perceptual quality, especially after SFT (4.97 MOS).
👥 Speaker Similarity: Improved from 0.37 → 0.46 (SIM) post-SFT.
⚡ Inference Speed: Fastest among all tested (Step-Audio, CosyVoice2, Spark-TTS, FireRedTTS, and GPT-SoVITS v3) models (real-time factor = 0.33).
🎯 Why It Matters:
First open-source LLM-TTS stack optimized end-to-end for long-form podcast content.
Fully reproducible training pipeline: data collection, LLM training, decoding, and inference—all open.
Achieves production-grade synthesis and real-time capabilities on modest compute—busting the myth that LLM-based TTS is out of reach for small labs and indie developers.
🛠️ Explore the code, pre-trained models, and docs: https://github.com/MYZY-AI/Muyan-TTS

A New Era in Audio AI: Introducing Kimi-Audio

Sankar Mukherjee — Tue, 29 Apr 2025 11:23:45 GMT

The shift in audio AI is underway — from narrow, task-specific systems to unified, foundation-level architectures. Just like NLP had its LLM moment, audio is entering its own revolution.
Kimi-Audio is a major leap: an open-source, real-time, instruction-following foundation model for audio understanding, generation, and conversation — all within one architecture.
🧠 What makes it different?
Kimi-Audio uses a hybrid input representation to bridge text and audio:
• Discrete semantic tokens (12.5Hz) from an ASR-based tokenizer
• Continuous acoustic vectors from Whisper, also downsampled to 12.5Hz
These are fed into a shared transformer initialized from a large language model, with:
• Dual heads for generating both text and discrete audio tokens
• A streaming detokenizer for real-time synthesis using flow matching + BigVGAN
• A novel look-ahead mechanism that smooths audio at chunk boundaries
The result: low-latency, high-quality, multimodal generation from a single, unified system.
🚀 Why it matters
Kimi-Audio changes how we build audio systems:
• Unified architecture replaces separate ASR, TTS, and sound models
• Real-time, instruction-following capabilities open new application domains
• Designed to support diverse data: speech, sound, music, and text
• Open-source release promotes transparency, reproducibility, and extensibility
This model is part of a broader vision for general-purpose audio intelligence — where models can hear, understand, and speak in natural, context-aware ways.
🛠️ Built for the community
Kimi-Audio was pre-trained on 13M+ hours of audio and fine-tuned on 300K hours of labeled tasks. It comes with:
• Full codebase
• Released model checkpoints
• Evaluation toolkit
🔗 Explore: https://github.com/MoonshotAI/Kimi-Audio

Synergistic Speech Synthesis: Bridging Modalities for Enhanced LLM-driven Text-to-Speech

Sankar Mukherjee — Mon, 21 Apr 2025 11:47:17 GMT

The landscape of text-to-speech synthesis has been significantly transformed by the integration of large language models through discrete tokenization, enabling remarkable control over various vocal attributes via speech prompt conditioning. Yet, does the necessity for discrete speech tokenization inherently compromise the nuanced acoustic fidelity and real-world applicability of these advanced architectures?
This work introduces GOAT-TTS, a novel dual-branch architecture designed to optimise LLM-based text-to-speech generation. The framework uniquely incorporates a modality-alignment branch that captures continuous acoustic embeddings, fostering a bidirectional understanding between paralinguistic features and semantic text representations without strict transcript alignment. Complementing this, the speech-generation branch employs modular fine-tuning on the LLM's top layers, preserving its native linguistic comprehension while optimising for speech token prediction, further enhanced by a multi-token prediction mechanism for real-time streaming.
This dual-branch approach directly tackles the irreversible loss of acoustic details associated with speech prompt quantization and circumvents the limitations imposed by the requirement for precisely aligned speech-text pairs. By using continuous acoustic embeddings, GOAT-TTS maintains rich paralinguistic information, and the transcript-free modality alignment significantly broadens its deployment flexibility. Furthermore, the selective fine-tuning strategy mitigates the catastrophic forgetting of the LLM's original text comprehension capabilities during the adaptation for speech token generation.
The training involved a two-stage process, starting with modality-alignment training using speech-text continuation pairs derived from ASR corpora and TTS-generated speech, followed by speech-generation training on approximately 150k hours of real-world conversational data, employing a layer-wise parameter freezing strategy and multi-token prediction.
Experimental evaluations on standard datasets demonstrate that this approach achieves performance comparable to state-of-the-art TTS models in non-streaming settings and significantly outperforms them in streaming scenarios, while also proving effective in generating high-quality dialect-specific speech data capable of enhancing dialect ASR model performance.
This innovation suggests a promising trajectory towards more robust and versatile LLM-driven speech synthesis systems capable of high-fidelity output and seamless integration into real-world applications, particularly those involving diverse linguistic and acoustic environments.
More Details: Paper

Tokens with Meaning: Redefining Audio Language Modeling through Semantic Compression

Sankar Mukherjee — Fri, 18 Apr 2025 07:39:27 GMT

The recent surge in audio language models highlights the crucial role of transforming raw audio into discrete tokens, enabling the application of large language model architectures to the auditory world. However, existing audio tokenisers often process individual audio segments independently, potentially overlooking broader contextual cues across time.
Could we achieve richer semantic representations and more efficient compression by explicitly modelling the inter-frame relationships in audio, rather than treating each segment in isolation? This question arises from the limitations of frame-based encoding prevalent in prior methods like Encodec.
The core innovation lies in a novel query-based compression strategy, where a set of learnable query tokens interacts with audio frames via transformer layers to capture holistic, cross-frame context. This approach contrasts with frame-by-frame encoding, allowing the model to glean more semantic information and represent audio with fewer tokens.
This innovation directly tackles the problem of information redundancy and the lack of explicit context modelling in existing low-bitrate audio codecs. By using query tokens, the model can focus on extracting the most salient information across the entire audio sequence, leading to more efficient compression and semantically richer tokens.
One potential drawback is that the introduction of more complex mechanisms like the query-based attention might lead to increased computational demands during training and inference compared to simpler frame-based methods, although the paper argues for lower bitrate and comparable efficiency.
The ALMTokenizer was trained on approximately 4,500 hours of diverse audio data, encompassing speech (LibriTTS, MLS), sound (AudioSet), and music (Million Song Dataset). The model employs transformer networks with up to 24 self-attention layers and is trained for 200k steps using AdamW with a learning rate of 1e-4, across 4 NVIDIA A100-80G GPUs.
ALMTokenizer achieves competitive reconstruction at a lower bitrate than SOTA tokenizers like Encodec and MimiCodec, while outperforming them in downstream tasks like TTS, ASR, and audio captioning. MUSHRA tests also confirm strong subjective quality across speech, music, and sound.
This work points to a promising future for audio language models, showing that context-aware tokenization enables better compression with richer semantics—paving the way for more efficient and powerful multimodal AI systems.
More details: paper

Real-Time Audio's Neural Frontier: Bridging Quality and Latency

Sankar Mukherjee — Wed, 16 Apr 2025 05:41:04 GMT

Recent strides in neural audio codecs have shown remarkable promise in discretizing audio signals with minimal bit usage while maintaining high fidelity, playing a crucial role in areas like real-time communication and speech language models. Yet, the prevalence of non-causal structures in these advanced codecs often leads to excessively high latency, hindering their applicability in scenarios demanding immediate interaction.
Can we truly achieve top-tier audio quality with neural codecs without sacrificing the crucial element of real-time responsiveness demanded by interactive applications?
The core innovation lies in the introduction of StreamCodec, a fully causal and symmetric encoder-decoder architecture operating in the MDCT domain, coupled with a novel residual scalar-vector quantizer (RSVQ). This RSVQ sequentially employs scalar quantizer for coarse audio contours and improved vector quantizer to refine acoustic details in a residual manner.
This design directly tackles the trade-off between coding quality and latency inherent in streamable neural audio codecs by ensuring fully causal operation for low latency and using RSVQ to compensate for any quality loss arising from this structural constraint. While the paper does not explicitly mention specific disadvantages of this innovation, the complexity of the RSVQ might introduce some computational overhead, although the results suggest high efficiency.
Experiments were conducted on the 16 kHz LibriTTS (approximately 263 hours) and 48 kHz VCTK (approximately 43 hours) datasets. StreamCodec achieved a ViSQOL score of 4.30 at 1.5 kbps on LibriTTS and demonstrated a near 20 times real-time generation speed on a CPU with a lightweight 7M parameter model.
The development of StreamCodec signifies a significant step towards achieving high-fidelity, low-latency neural audio compression, paving the way for enhanced real-time communication applications and potentially influencing the design of future streamable audio processing systems.
More details: https://lnkd.in/dFz2zACN

Decoupled Training: Pioneering Offline Quantization for Neural Audio Compression

Sankar Mukherjee — Fri, 11 Apr 2025 08:01:51 GMT

Neural audio codecs, particularly those employing neural networks to compress waveforms into discrete tokens, have become vital for recent advancements in audio generative models, with many state-of-the-art methods relying on the end-to-end training of autoencoders with a quantization bottleneck.
But if the joint training of autoencoders and quantizers has been the dominant paradigm, does this fundamentally limit the exploration of more powerful and flexible quantization techniques developed independently?
The core innovation lies in a three-stage strategy termed QINCODEC, which involves pre-training a continuous compression autoencoder, followed by an offline quantization of the latent representations using a state-of-the-art trainable quantizer like QINCO2, and an optional final finetuning of the decoder.
This decoupling of training addresses the limitations of end-to-end approaches by allowing the use of any off-the-shelf quantizer, especially those with complex training procedures unsuitable for online updates, and simplifies the training process significantly.
A potential trade-off of this approach is that the offline quantization is inherently limited by the quality of the pre-trained latent representations and the bitrate, which can impact performance, particularly at lower bitrates.
The compression models were trained on 1-second, 44.1kHz audio clips from the WavCaps dataset, which contains around 400k general sounds, using 8 A100 GPUs for 1 million steps with an effective batch size of 240.
QINCODEC achieved competitive results at 8 kbps and outperformed state-of-the-art methods like DAC and ENCODEC at 16 kbps in various objective and subjective metrics, demonstrating the viability of non-end-to-end training.
This work suggests a future where audio codec design can be more modular and adaptable, amortising the cost of autoencoder pre-training and enabling the integration of advanced offline quantization techniques to potentially improve compression efficiency and quality for audio generative models.

More details: paper

The Simplicity of Sound: A Leaner Approach to Audio Compression

Sankar Mukherjee — Thu, 10 Apr 2025 09:23:05 GMT

Recent advancements in neural audio codecs have largely focused on achieving superior audio quality through increasingly intricate architectures that often employ multiple quantizers for encoding. But in the relentless pursuit of fidelity, have we overlooked the practical limitations imposed by the sheer computational cost and complexity of these multi-stage quantization processes?
This research introduces SQCodec, a novel neural audio codec that pioneers a return to elegance by demonstrating that high-fidelity audio compression can be achieved with a significantly more lightweight architecture relying on a single quantizer. Its core innovation lies in a carefully engineered design featuring streamlined convolutional networks, local Transformer modules, and the introduction of TConv, a mechanism specifically designed to capture both short- and long-term acoustic variations efficiently.
By adopting a single quantizer and a lightweight structure, this innovation directly tackles the inherent challenges associated with complex multi-quantizer systems, such as their considerable computational and memory demands which hinder real-world scalability. Furthermore, it simplifies the tokenisation process, avoiding the hierarchical token streams that complicate integration with downstream generative models and require specialised aggregation operations.
While the move to a single quantizer inherently presents a risk of reduced codebook capacity, potentially impacting reconstruction accuracy, this work addresses it through the implementation of Finite Scalar Quantization (FSQ), which supports large codebooks without the typical collapse issues. It's also worth noting that at very low bitrates, the Mel Spectrogram Distance (MEL) of this codec may be slightly higher compared to those explicitly optimising for it, as the training prioritises STFT distance.
The model was trained on a diverse set of audio data, including speech from LibriSpeech and Common Voice, music from MTG-Jamendo, and general sounds from FSD50K, showcasing its ability to learn across different acoustic domains. This training was efficiently conducted on a single NVIDIA RTX 4090 GPU, highlighting the model's reduced computational demands.
Crucially, experimental results demonstrate that SQCodec achieves audio quality comparable to state-of-the-art multi-quantizer codecs like DAC, while requiring an order of magnitude fewer parameters and significantly less computational overhead. At ultra-low bitrates, it often surpasses the performance of other single-quantizer models like WavTokenizer across various objective metrics.
This work paves the way for a future where high-fidelity neural audio compression is more accessible and integrable, particularly in resource-constrained scenarios and for applications like multimodal large language models and real-time communication. The success of this lightweight single-quantizer design opens exciting new avenues for optimising efficient audio processing and hardware-aware implementations.
More details: Paper

Advancing Text-to-Speech: Acoustic Models and Vocoders in Focus

Sankar Mukherjee — Thu, 10 Apr 2025 07:05:52 GMT

For the technically curious: let’s dissect the nuts and bolts of TTS pipelines—a domain where innovation is turning text into hyper-realistic speech at scale.
🔑 Acoustic Models
At the heart of TTS, acoustic models predict features (e.g., mel-spectrograms) that capture the nuances of human speech:
RNN-based Models: Early stars like Tacotron 2 combined sequential processing and attention mechanisms to align text with speech, but autoregressive decoding limits their speed.
CNN-based Models: Enter Deep Voice and ParaNet, with parallel processing and faster inference by capturing dependencies across entire sequences simultaneously.
Transformer-based Models: TransformerTTS and FastSpeech revolutionized TTS by leveraging self-attention for global context modeling, enabling long-term prosody and rhythm. FastSpeech 2 even integrated pitch, energy, and duration predictors for enhanced expressiveness.
LLM-based Approaches: Cutting-edge models like PromptTTS and InstructTTS bridge natural language and speech synthesis, offering fine-grained control over tone, emotion, and prosody using text-based prompts.
🔊 Speech Vocoders
Turning acoustic features into waveforms that sound natural is no small feat. Here’s what’s shaping this critical step:
GAN-based Vocoders: HiFi-GAN and Parallel WaveGAN leverage adversarial training to generate natural speech efficiently while maintaining frequency and time-domain coherence.
Diffusion-based Vocoders: WaveGrad and DiffWave are redefining quality by iteratively refining waveforms from noise, though they challenge real-time applications with computational intensity.
RNN/CNN Vocoders: Models like WaveRNN and Parallel WaveNet balance speed and quality, introducing multi-band strategies and non-autoregressive techniques for faster synthesis.
📊 Representation Matters
Continuous Features (e.g., mel-spectrograms): Offer rich detail for expressiveness and prosody control but demand more compute.
Discrete Tokens (e.g., quantized units): Compact and efficient, perfect for LLM-based and zero-shot TTS, but sometimes lose nuanced speech details.
Whether you're optimizing FastSpeech for real-time synthesis, leveraging HiFi-GAN for natural voices, or exploring end-to-end systems like VITS, the choices in acoustic modeling and vocoders determine your system’s balance between quality, efficiency, and flexibility.
💡 Want to learn more about this emerging space? Explore the details here:
👉 https://arxiv.org/abs/2412.06602
💡 What’s next? Diffusion-based real-time vocoders? Transformer-vocoder hybrids? Share your ideas or challenges—let’s push TTS further!

Revolutionizing Audio Generation: Introducing X-Codec

Sankar Mukherjee — Thu, 10 Apr 2025 06:51:53 GMT

The landscape of audio generation has been evolving rapidly, thanks to advancements in Large Language Models (LLMs). However, there's been a persistent challenge: existing audio codecs, like EnCodec and HuBERT, fall short in maintaining semantic and acoustic integrity when used for audio LLM tasks.
Let's dive into the key pain points:
🔹 Acoustic Codecs (e.g., EnCodec): While great for compressing audio, these struggle with "low-level fluctuations" in audio data, forcing LLMs to rely on massive datasets for training and causing slow convergence.
🔹 Semantic Codecs (e.g., HuBERT): These excel at high-level semantic representation but ignore critical acoustic details, leading to two-stage frameworks for generation (semantic first, acoustic later).
🔹 Hybrid Attempts: Models like SpeechTokenizer try to combine semantic and acoustic features but rely on complex two-stage AR+NAR frameworks.
🎙️ Introducing X-Codec: Redefining Audio Encoding for LLMs 🚀
Enter X-Codec, a groundbreaking approach to audio encoding that integrates semantic information directly into the codec. Here's how it’s transforming audio generation:
🔹 "X-Shaped" Architecture:
X-Codec combines semantic and acoustic features in a unified Residual Vector Quantization (RVQ) structure. This ensures tokens capture both semantic richness and acoustic fidelity, enabling accurate and streamlined audio generation.
🔹 Semantic Reconstruction Loss:
By introducing a loss function after the RVQ stage, X-Codec guarantees that semantic integrity is preserved throughout the generation process.
🔹 Improved Performance Across Tasks:
X-Codec drastically reduces WER in text-to-speech tasks while excelling in music continuation and text-to-sound applications. Its uniform tokenization ensures compatibility with existing LLMs without requiring structural changes.
🔹 Enhanced Phonetic Discriminability:
Better phonetic precision compared to traditional acoustic codecs, especially with increased quantization layers.
🔹 Versatility & Compatibility:
Whether it's speech, music, or sound generation, X-Codec’s semantic-aware tokens deliver superior results without disrupting existing workflows.
📄 Read the full paper: Paper
X-Codec is paving the way for more accurate, efficient, and versatile audio generation. What potential applications do you see for this innovation in your domain?

PhonoByte: Short Summary

BitNet Distillation: The 3-Stage Pipeline That Delivers Full-Precision LLM Performance at 1.58-bit Ultra-Low Quantization

The Problem

The Solution Suggested

How the Solution Was Achieved

Stage 1: Modeling Refinement

Stage 2: Continued Pre-Training (Crucial Warm-up Step)

Stage 3: Distillation-based Fine-tuning

The Results of These Solutions

MOSS-Speech

Revolutionizing Text-to-Speech: How Differentiable Reward Optimization is Changing the Game

Paper

The Problem with Traditional TTS Training

Introducing DiffRO: A Paradigm Shift

1. Token-Level Reward Prediction

2. Differentiable Training Process

3. Multi-Task Reward (MTR) Model

Impressive Results Across Multiple Dimensions

Beyond the Technical Achievement

Looking Forward

Anthropic’s Modular Code Protocol (MCP)

A Glimpse into the First Duplex S2S Model Without Speech Pre-training

Architecting Real-time Speech Interaction: Step-Audio's Approach to Seamless Tool Calling

Why is this decoupling significant?

The result?

VITA-Audio: Real-Time Speech Generation, Redefined

How Voila Speaks with Nuance: The Power of Structured Interleaved Alignment

INTP (Intelligibility Preference Speech Dataset): Making Zero-Shot TTS More Robust and Intelligible

Ready to hear your LLM speak? Breakthrough in efficient Text-to-Speech with PEFT!

💡 What’s New:

🔧 How It Works:

📊 Training:

🔥 Results:

🌍 Why It Matters:

Rethinking Efficiency in State-of-the-Art Neural Codecs: Variable Bitrate Residual Vector Quantization (VRVQ)

🔍 What’s New?

🧠 How It Works

📊 Results

⚠️ Limitation:

📌 Why It Matters

Muyan-TTS: Open-Source LLM-Based TTS Optimized for Podcasting—Built on a $50K Budget

A New Era in Audio AI: Introducing Kimi-Audio

Synergistic Speech Synthesis: Bridging Modalities for Enhanced LLM-driven Text-to-Speech

Tokens with Meaning: Redefining Audio Language Modeling through Semantic Compression

Real-Time Audio's Neural Frontier: Bridging Quality and Latency

Decoupled Training: Pioneering Offline Quantization for Neural Audio Compression

The Simplicity of Sound: A Leaner Approach to Audio Compression

Advancing Text-to-Speech: Acoustic Models and Vocoders in Focus

Revolutionizing Audio Generation: Introducing X-Codec