A discussion with Claude Sonnet 4.5 on designing a TTS model with voice cloning - February 10, 2026 USER: Hi Claude, I want to design and train a small text-to-speech (TTS) model but I haven't done this before and I could use some guidance. I have training data which includes short wav audio files along with text transcripts, and it has examples from multiple different speakers. Each speaker has a unique ID, an age, and a gender. I'd like to support voice cloning using only a short voice sample and maybe also the transcript. So the TTS model needs some type of input which gives it information about the speaker. How do you think I should approach this problem? CLAUDE: Great project! Building a multi-speaker TTS system with voice cloning is definitely achievable. Here's how I'd approach this: Core Architecture Approach For multi-speaker TTS with cloning capability, you'll want to use speaker embeddings as conditioning inputs. This is the standard approach and works well for your use case. The idea is to encode speaker characteristics into a compact vector representation that guides the model's voice generation. Two Main Architectural Options 1. Speaker ID Embeddings (simpler, good baseline) * Create a learned embedding lookup table indexed by speaker ID * During training, each speaker ID maps to a trainable vector * For voice cloning at inference, you'd train a separate speaker encoder that maps audio samples to the same embedding space * This works well when you have enough data per speaker 2. Speaker Encoder from the Start (better for cloning) * Train a speaker encoder network alongside your TTS model * The encoder takes a short audio clip and outputs a speaker embedding vector * Use these embeddings to condition your TTS model * At inference, just encode the reference audio—no need for speaker IDs * This is what models like Tacotron 2 with speaker encoding or systems like Real-Time Voice Cloning do I'd recommend option 2 since voice cloning is a priority for you. Here's why: the speaker encoder learns to extract voice characteristics directly from audio, so at inference you just provide a short sample and it works immediately without additional training. Recommended Model Architecture A good starting point would be something like: * Speaker Encoder: A network (often CNN or RNN-based) that processes mel-spectrograms from reference audio and outputs a fixed-size embedding (e.g., 256-dimensional) * TTS Model: Use an established architecture like Tacotron 2, FastSpeech 2, or VITS (more modern) * Conditioning: Concatenate or add the speaker embedding to the encoder outputs or decoder inputs at each timestep Using the Metadata The age and gender metadata could be useful too: * You could encode these as additional embedding vectors and concatenate them with speaker embeddings * Or use them as auxiliary training signals * However, if your speaker encoder is good enough, it might implicitly capture these characteristics from the voice itself Practical Recommendations 1. Start with a proven architecture: VITS is state-of-the-art and end-to-end, while Tacotron 2 is more established and easier to debug 2. Speaker encoder training: You might want to pre-train it using a speaker verification loss (like GE2E or prototypical networks) or train it jointly with your TTS 3. Data requirements: Multi-speaker models need decent amounts of data—ideally at least 10-20 speakers with several minutes each for good generalization 4. Voice cloning quality: Quality depends heavily on having diverse speakers in training to learn what aspects of voice are speaker-specific vs content-specific Would you like me to dive deeper into any particular aspect, like the speaker encoder architecture, training strategies, or specific model recommendations? USER: My goal is to use a speaker encoder, so I'd like to know more about speaker encoder training. It's unclear to me how a network can be trained to produce speaker embeddings from mel spectrograms. What is the expected output of the network and how do we compute the loss/error? CLAUDE: Great question! This is the key insight for understanding speaker encoders. The trick is to train the network with a metric learning objective that forces embeddings from the same speaker to be similar and embeddings from different speakers to be different. How It Works The speaker encoder doesn't predict a specific "correct" embedding. Instead, it learns to map audio to an embedding space where distance has meaning: * Audio from the same speaker → embeddings close together * Audio from different speakers → embeddings far apart Training Process Input: Mel-spectrogram segments from your audio clips Output: A fixed-size embedding vector (e.g., 256-D) Loss: Computed based on relationships between embeddings in a batch Common Loss Functions 1. GE2E Loss (Generalized End-to-End) This is what Google used for their speaker verification and is very popular: * Sample multiple utterances per speaker in each batch * For each embedding, compute: * Similarity to its own speaker's centroid (should be high) *Similarity to other speakers' centroids (should be low) * The loss pushes same-speaker embeddings toward their centroid and away from other speakers' centroids 2. Contrastive Loss * Take pairs of audio segments * Positive pairs (same speaker): minimize distance between embeddings * Negative pairs (different speakers): maximize distance (up to a margin) 3. Triplet Loss * Sample triplets: (anchor, positive, negative) * Anchor: reference audio from speaker A * Positive: different audio from speaker A * Negative: audio from speaker B * Loss: ensure distance(anchor, positive) < distance(anchor, negative) by some margin Practical Example (GE2E) Here's the conceptual flow: Batch structure: - N speakers - M utterances per speaker - Total: N × M mel-spectrograms For each mel-spec: 1. Pass through encoder → get embedding e_ij (speaker i, utterance j) 2. Compute centroid for each speaker: c_i = mean of all embeddings from speaker i 3. For embedding e_ij, compute: - Similarity to own centroid: sim(e_ij, c_i) - Similarity to other centroids: sim(e_ij, c_k) for k ≠ i 4. Loss encourages: - High similarity to own centroid - Low similarity to other centroids The similarity function is typically cosine similarity (dot product of normalized vectors), and you use a softmax-based loss similar to classification. Why This Works for TTS Once trained, the encoder has learned to extract "what makes a voice unique" into the embedding vector. When you: 1. Feed a short reference audio clip through the encoder 2. Get the embedding 3. Condition your TTS model on this embedding The TTS model learns during its own training that this embedding contains speaker characteristics, so it generates speech matching that voice. Training Strategy You can either: * Pre-train the speaker encoder on a speaker verification task, then freeze it while training TTS * Joint training: Train speaker encoder and TTS model together (more complex but can be better) Does this clarify how the speaker encoder training works? The key insight is that you're training it to create a useful embedding space through relationships between samples, not to predict specific target values. USER: I see, but that approach seems to ignore the fact some speakers are more similar than others, it just seems to separate each speaker regardless of similarity. I feel like there should be a more general approach which does a better job of training the network to capture as much information about the speaker as possible into speaker embeddings. This might be a bad idea, but I was thinking one possible approach might be to train an auto-encoder which takes a mel spectrogram (or some mean/centroid of the speaker samples) as the input, and the output of the network has to reproduce the input. The bottleneck layer would contain a compressed representation of the input which might work as a speaker embedding. Even if that doesn't work, having a mel spectrogram auto-encoder might be useful because the TTS model could be trained to output mel spectrogram embeddings which could then be decoded into the full resolution spectrogram, then that would be converted back to audio. Is that approach something that has been tried before? What type of output format do TTS models typically use? CLAUDE: You're thinking along exactly the right lines! Your intuition about the limitations of speaker verification losses is spot-on, and your autoencoder idea has actually been explored in various forms. Your Autoencoder Idea The challenge with a straightforward mel-spectrogram autoencoder for speaker embeddings is that the bottleneck would need to encode both content (what's being said) and speaker characteristics (who's saying it). You'd need a way to disentangle these. However, there are more sophisticated approaches that address this: Variational Autoencoders (VAEs) are commonly used because: * The latent space is encouraged to be smooth and continuous (better for interpolation) * You can condition the decoder on text/phonemes, forcing the latent code to capture only speaker info * This is used in some TTS architectures Better approach for your use case: Train an autoencoder where: * Input: mel-spectrogram * Encoder produces: speaker embedding (bottleneck) * Decoder is conditioned on: both the speaker embedding and the phoneme/text sequence * This forces the speaker embedding to only capture speaker info, not content Regarding TTS Output Formats Yes, using compressed representations has absolutely been tried and is actually very common! Here are the main approaches: 1. Direct Mel-Spectrogram Prediction (Tacotron 2, FastSpeech) * TTS outputs mel-spectrograms directly * Separate vocoder (WaveGlow, HiFi-GAN, etc.) converts mel → audio * Simple and works well 2. Compressed Latent Representations (Modern Approach) * VITS (Variational Inference TTS): Uses VAE framework, operates in latent space * Neural Codec Models (SoundStream, Encodec): Compress audio into discrete tokens, TTS predicts these tokens * Your idea of TTS → compressed embedding → decoder is essentially this! 3. Discrete Token Approaches (Cutting Edge) * VALL-E, SPEAR-TTS, and similar: Use neural audio codecs * Audio is quantized into discrete tokens (like text tokenization) * TTS becomes sequence-to-sequence: text tokens → audio tokens * Decoder reconstructs waveform from tokens Why Compressed Representations? You're right that this has advantages: * Efficiency: Smaller sequence length to model * Better generalization: Removes unnecessary detail * Implicit disentanglement: Compression can separate content from speaker characteristics * Faster training/inference: Less data to process Modern State-of-the-Art Models like VITS combine several of your ideas: * Variational autoencoder framework * Learns compressed latent representations * Speaker embeddings condition the generation * End-to-end training (no separate vocoder needed) Or neural codec approaches where: * Pre-train an audio autoencoder (like Encodec) * TTS model predicts compressed codes * Decoder reconstructs high-quality audio Recommendation for Your Project Given your thinking, I'd suggest: 1. Start with VITS architecture - it incorporates VAE principles and is state-of-the-art 2. Or try a neural codec approach: Train/use a pre-trained audio codec (Encodec is open-source), then train your TTS to predict codec tokens conditioned on speaker embeddings Both approaches align with your intuition about compressed representations and would work well with speaker embeddings. Your instinct about autoencoder-based compression is solid and reflects how modern TTS has evolved. The key innovations have been about how to structure that compression to separate speaker identity from linguistic content. Want to discuss implementation details for either approach? Can you briefly explain the difference between VAEs and normal auto-encoders. I'm guessing they are a bit trickier to train? But you said a better approach might be a normal auto-encoder conditioned on both the speaker embedding and the transcript? I don't really understand why that forces the speaker embedding to only capture speaker info. Maybe we need to think more about what actually counts as useful speaker information so we have a better idea of what type of architecture is most suited for the problem. 8:15 PM Great questions! Let me address these in order. VAEs vs Regular Autoencoders Regular Autoencoder: * Encoder maps input → single embedding vector * Decoder reconstructs from that vector * No constraints on what the latent space looks like VAE (Variational Autoencoder): * Encoder outputs two things: mean (μ) and variance (σ²) of a distribution * You sample from that distribution to get the latent vector * Additional loss term (KL divergence) encourages the distribution to be close to standard normal N(0,1) * This regularizes the latent space to be smooth and continuous VAEs are a bit trickier to train (need to balance reconstruction loss and KL loss), but the benefit is you get a well-structured latent space that's better for generation and interpolation. Why Conditioning on Text Would Force Speaker-Only Info You're right to question this—let me think through it more carefully. The idea would be: Input: mel-spectrogram of "hello world" from speaker A Encoder: produces speaker embedding Decoder: takes speaker embedding + text "hello world" → reconstructs mel-spectrogram The theory: if the decoder has access to the text/content, the bottleneck embedding doesn't need to encode "what was said," only "who said it." But you've identified the flaw: this doesn't actually guarantee the embedding only contains speaker info! The embedding could still encode: * Speaking rate, prosody, emotion * Pronunciation quirks * Content information (redundant with text but still there) So you're right—this doesn't cleanly separate speaker identity from everything else. What Actually Counts as Useful Speaker Information? This is the crux of the issue. Let's think about what varies in speech: Speaker-specific (what we want): * Voice timbre/quality (vocal tract shape, vocal fold characteristics) * Pitch range (fundamental frequency) * Formant frequencies Not purely speaker-specific: * Speaking rate (speaker + mood + context) * Prosody/intonation (speaker + emotion + meaning) * Accent/pronunciation (speaker + language background) * Energy/loudness (speaker + emotion + context) Content: * Phonemes/words being spoken * Linguistic meaning For TTS with cloning, you probably want the embedding to capture timbre and pitch characteristics, but you want controllable prosody (so the TTS can generate natural-sounding speech for any text, not just mimic the reference clip). Better Approaches Given this nuance, here are more sophisticated options: 1. Multi-level representations * Speaker embedding: timbre only * Separate prosody/style embeddings: learned from reference but controllable * This is what some modern systems do (YourTTS, NaturalSpeech) 2. Disentanglement through multiple reference clips * Train with multiple clips from same speaker saying different things * Force the encoder to extract what's consistent across clips (speaker identity) * What varies across clips (content, prosody) can't be reliably encoded 3. Explicit disentanglement losses * Use adversarial training or mutual information minimization * Explicitly penalize the speaker embedding if it encodes content information * More complex but more principled 4. Accept some entanglement * Maybe it's okay if the embedding captures speaking style, not just pure timbre * As long as the TTS can generalize to new text, some style capture might even be desirable My Revised Recommendation For a practical system, I'd suggest: Option A (Simpler): Use the metric learning approach (GE2E/contrastive) but sample diverse utterances per speaker during training. The encoder will naturally learn to extract consistent speaker characteristics across different content. Option B (More sophisticated): Train an encoder that takes multiple reference clips from the same speaker and outputs a single embedding. This architecturally enforces learning speaker-consistent features. You're absolutely right that we need to think carefully about what information we want encoded. The "condition on text" approach isn't as clean as I initially suggested. What do you think—does the multi-clip approach or the metric learning with diverse samples seem more aligned with your goals? USER: I have many samples of the same speaker saying different things, so the multi-clip approach seems best. However, it's unclear to me how I would design the network. The decoder has access to the transcript, and the encoder takes in multiple reference clips at the same time, is that what you're suggesting? Would the samples/clips need to be categorized in some way? I would prefer to avoid that extra work. Some entanglement in the speaker embeddings probably is desirable because I want the voice cloning to match the general style of the sample used for cloning. CLAUDE: Good thinking! Let me clarify the multi-clip approach because I think there's been some confusion about the training setup. The Key Insight The multi-clip idea isn't about the decoder having access to transcripts during autoencoder training. Rather, it's about how you structure the speaker encoder training to force it to extract speaker-consistent information. Here's the approach: Training the Speaker Encoder (Separate Step) Input to encoder: Multiple mel-spectrogram clips from the same speaker (different utterances) Architecture options: 1. Pooling approach (simplest): * Pass each clip independently through the same encoder network → get N embeddings * Pool them (mean/max pooling) → single speaker embedding * Loss: metric learning (GE2E/contrastive) comparing pooled embeddings across speakers 2. Attention-based: * Encode each clip → sequence of embeddings * Use attention or a learnable aggregation to combine them * Output: single speaker embedding No categorization needed! You just: * Sample N clips randomly from speaker A * Sample N clips randomly from speaker B, C, etc. * Create batches with multiple speakers * Train with contrastive/GE2E loss The network learns: "extract features that are similar across these random clips from the same person." At Inference (Voice Cloning) You'd typically use just 1-3 short clips for cloning: * Pass through encoder → get embedding * This embedding captures speaker characteristics * Use it to condition your TTS The encoder learned to extract consistent features during training, so even a single clip works reasonably well. Then: Training the TTS Model This is a separate stage (or joint if you're ambitious): Input: Text/phonemes + speaker embedding Output: Mel-spectrogram (or compressed representation) Training: Standard TTS loss (reconstruction, alignment, etc.) The TTS model learns: "given this speaker embedding and this text, generate appropriate mel-spectrogram." Practical Training Flow Stage 1 - Speaker Encoder: - Sample 5-10 clips per speaker per batch - Encode and pool → speaker embeddings - GE2E loss to separate speakers Stage 2 - TTS Model: - For each training sample (text + audio): - Get speaker embedding from 1-3 reference clips (using trained encoder) - Train TTS: text + speaker_embedding → target mel-spectrogram Why This Captures Style Since you're allowing some entanglement (which I agree is good!), the encoder will naturally capture: * Voice timbre * Typical prosody patterns * Speaking style * Pace preferences When you clone with a reference clip, you get all of these characteristics, which is what you want! Does This Make Sense? So to directly answer: No special categorization needed. Just randomly sample multiple clips from each speaker during encoder training. The architecture forces learning speaker-consistent features because it has to produce similar embeddings for random clips from the same person. Is this clearer, or would you like me to elaborate on any specific part? USER: Yes, that makes your multi-clip idea much clearer to me. You don't give the network multiple samples at once, you pass them separately and then average the resulting embeddings to get a single speaker embedding. I did think of that but I wasn't sure how well it would work, has it been tried before? I do like it though, because it means the decoder can be used to decode the output of the TTS... actually now that I think about it, doesn't the decoder work like a TTS system? I understand the decoder transcript access thing is unrelated to the multi-clip idea, but it's still a good idea isn't it? If the decoder requires a speaker embedding and some text/phonemes, doesn't that make it capable of text-to-speech? CLAUDE: Yes! You've hit on something really important here. Let me address both points: Multi-Clip Averaging Yes, this has absolutely been tried and works well! It's used in several successful systems: * Google's speaker verification work (where GE2E loss comes from) uses exactly this * Real-Time Voice Cloning (CorentinJ's popular implementation) averages embeddings from multiple clips * Many speaker recognition systems use this approach It works because if the encoder has learned good speaker-discriminative features, averaging multiple embeddings from the same speaker: * Reduces noise/variance from individual clips * Strengthens the consistent speaker characteristics * Gives you a more robust representation At inference, even a single clip works reasonably (because the encoder was trained to extract consistent features), but averaging 2-5 clips gives better quality. The Decoder IS Basically a TTS! You're absolutely right! This is a key insight. If you train an autoencoder where: * Encoder: mel-spectrogram → speaker embedding * Decoder: speaker embedding + text/phonemes → mel-spectrogram Then yes, the decoder is literally a TTS model! This is actually a very elegant architecture because: 1. Single unified system: The decoder you use for autoencoder training IS your TTS 2. Efficient training: You're learning both the speaker encoder and TTS jointly 3. Guaranteed compatibility: The speaker embeddings are designed to work with your TTS decoder How This Would Work in Practice Training: Input: audio clip + transcript 1. Extract mel-spectrogram from audio 2. Encoder(mel) → speaker_embedding 3. Decoder(speaker_embedding + phonemes) → reconstructed_mel 4. Loss: reconstruction error between original and reconstructed mel At inference (voice cloning): Input: reference audio clip(s) + new text 1. Encoder(reference_mel) → speaker_embedding (or average multiple: mean([Encoder(clip1), Encoder(clip2), ...])) 2. Decoder(speaker_embedding + new_phonemes) → output_mel 3. Vocoder(output_mel) → audio waveform This Architecture Has Been Explored! Systems similar to this include: * Style Tokens / GSTs (Global Style Tokens) in Tacotron * VAE-based TTS where the VAE latent captures speaker/style * Some multi-speaker Tacotron variants The main architectural consideration is how the decoder combines the speaker embedding with the phoneme sequence: * Concatenate embedding to each encoder timestep * Add as conditioning to attention mechanism * Feed into decoder at each step Advantages of This Approach 1. Simpler overall: One model instead of separate speaker encoder + TTS 2. Direct optimization: The encoder learns to produce embeddings that work well for reconstruction 3. Natural disentanglement: The decoder has access to phonemes, so the embedding can focus on speaker characteristics One Consideration You mentioned wanting to avoid the decoder having access to transcripts earlier, but I think that concern was about forcing clean separation. Given you're okay with some entanglement (style, prosody), this unified approach actually seems ideal for your use case! The autoencoder framework with decoder-as-TTS is a solid, proven approach. Does this architecture align with what you're envisioning? USER: My only concern with that approach would be that the embeddings may still capture something about the content of the sample, so using an average of embeddings when using the decoder for TTS might result in weird audio glitches. CLAUDE: That's a very valid concern! You're right to think about this carefully. The Content Leakage Problem If embeddings capture content information, then averaging embeddings from clips saying different things could give you: * A "nonsense" averaged content representation * Potential glitches or artifacts when the decoder tries to interpret contradictory signals * The decoder receiving both "phonemes say X" and "embedding suggests content Y" Why This Might Not Be As Bad As It Seems 1. Decoder has strong content signal The phoneme/text input is typically a much stronger, more explicit signal for content than any implicit encoding in the speaker embedding. The decoder would likely learn to: * Trust the phoneme input for "what to say" * Use the embedding for "how to say it" 2. Averaging naturally cancels content If you average embeddings from clips with different content, any content-specific information would be inconsistent and average out to noise, while speaker characteristics remain consistent. 3. Empirical evidence Systems that use this approach (like speaker verification embeddings for TTS) generally work fine, suggesting content leakage isn't catastrophic in practice. But You Could Be More Rigorous If you want to minimize this risk, here are some approaches: Option 1: Multi-clip encoder architecture Instead of averaging after encoding, feed multiple clips into a network that processes them together: Input: Multiple mel-spectrograms Architecture: - Shared encoder for each clip - Cross-attention or set-based aggregation (e.g., DeepSets) - Output: single speaker embedding This architecture can learn to explicitly ignore varying content and focus on consistent features. Option 2: Adversarial content removal Add an auxiliary loss during training: * From the speaker embedding, try to predict the phoneme content * Add a gradient reversal layer so the encoder learns to make this prediction harder * Forces the embedding to not contain content information Option 3: Augmentation during training When training the autoencoder: * Always use averaged embeddings (from 2-5 random clips of the same speaker) * Never use single-clip embeddings * This forces the system to work with averaged embeddings from the start Option 4: VAE with strong prior Use a VAE framework where the KL divergence loss strongly regularizes the latent space, making it harder to encode content-specific details. My Recommendation I'd suggest starting with Option 3 (training with averaged embeddings) because: * Simple to implement * Forces the model to handle exactly the inference scenario * Naturally encourages content-invariant embeddings * No complex architectural changes needed You could combine it with the multi-clip encoder (Option 1) for even better results. Testing Your Concern You could empirically test for content leakage: * Take embeddings from the same speaker saying very different things * Average them * Generate speech * Check if there are glitches or if content from reference clips "bleeds through" If problems emerge, then add the more sophisticated disentanglement techniques. Does this address your concern? The training-with-averaging approach seems like a practical starting point. USER: I agree with your recommendation, but doesn't that mean we need to do multiple forward passes in order to get an average embedding we can use with backprop? Could that potentially cause any problems when the network initially starts training? CLAUDE: Excellent catch! Yes, you're right that this requires multiple forward passes, and yes, this could potentially cause issues early in training. The Multiple Forward Pass Issue To get an averaged embedding for training: # Sample 3-5 clips from same speaker embed1 = encoder(clip1) embed2 = encoder(clip2) embed3 = encoder(clip3) avg_embed = (embed1 + embed2 + embed3) / 3 # Then use avg_embed in decoder output = decoder(avg_embed, phonemes) loss = reconstruction_loss(output, target) loss.backward() This is totally fine for backprop! The gradients flow back through the averaging operation to each individual encoder forward pass. No technical problem there. Potential Early Training Issues The real concern is: Early in training, when the encoder produces random/noisy embeddings, averaging them might give you an even more "confused" signal. The decoder receives an average of garbage and has to somehow learn from that. This could potentially: * Slow down initial learning * Make training less stable at the start * Give weaker gradients to the encoder initially Solutions Option 1: Curriculum learning (gradual averaging) Start simple, increase difficulty: Epochs 1-10: Use single clip (no averaging) Epochs 11-20: Average 2 clips Epochs 21+: Average 3-5 clips This lets the network learn basic speaker encoding first, then adapt to averaged embeddings. Option 2: Mixed training Randomly choose during training: * 50% of batches: single clip * 50% of batches: averaged clips This gives the network experience with both scenarios. Option 3: Pre-train the encoder * First train just the speaker encoder with metric learning (GE2E/contrastive) * This gives you reasonable embeddings from the start * Then train the full autoencoder system with averaging * The encoder starts from a better initialization Option 4: Just go for it Modern optimizers (Adam, AdamW) are pretty robust. The network might just figure it out. The decoder will learn that averaged embeddings are the signal it gets, and the encoder will learn to produce consistent embeddings because that's what minimizes reconstruction loss. My Take I'd probably try Option 4 first (just train with averaging from the start) because: * Simpler - no curriculum scheduling * The loss signal is clear: produce consistent embeddings or reconstruction fails * If it doesn't work well, fall back to Option 1 (curriculum) or Option 3 (pre-training) But Option 3 (pre-train encoder with metric learning first) is the most principled and would likely give you the smoothest training overall.