It’s almost a year since GPT-4o came out - and we haven’t had much that is end-to-end and multi-modal from open source since. The main open source example is Moshi - an incredible paper btw - although performance was quite short compared to GPT-4o.
Now, we’re on the cusp of Llama 4 - which Zuckerberg says will be end-to-end multi-modal - and we are seeing token-based audio models already, from Sesame with CSM-1B and Canopy Labs with Orpheus.
In this video, I explain what end-to-end multi-modal models (i.e. text + speech) can be built using a “token-based” approach. Basically, you convert everything (audio included, using hierarchical tokenisation) into tokens.
Then you just use transformers!!! This makes the models quite a bit easier to handle than complicated diffusion based approaches (like StyleTTS2).
I cover Moshi, CSM-1B (a text to speech model) and Orpheus (also text to speech) and I describe not just how to use CSM-1B and Orpheus models, but also how to do voice cloning AND fine-tuning on Orpheus AND a combo of voice cloning + fine-tuning on Orpheus.
And, all of the scripts are available as part of the ADVANCED-transcription repo.
Cheers, Ronan
PS: I've rotated all hf access keys
Trelis Links:
🤝 Are you a talented developer? Work for Trelis
💡 Need Technical or Market Assistance? Book a Consult Here
💸 Starting a New Project/Venture? Apply for a Trelis Grant
Video Links:
* Slides
* One-click Runpod template (affiliate)
* Llama 3 Paper
* StyleTTS2
* Moshi
* Orpheus
* Sesame’s CSM-1B
* Colab Notebook - Orpheus Cloning
* Colab Notebook - Orpheus Inference
TIMESTAMPS:
00:00 Introduction to End-to-End Audio + Text Models like GPT-4o and Llama 4 (?)
01:04 End-to-End Multimodal Models and Their Capabilities
02:36 Traditional Approaches to Text-to-Speech
03:06 Token-Based Approaches and Their Advantages
03:25 Detailed Look at Orpheus and CSM-1B Models
06:58 Training and Inference with Token-Based Models
12:53 Hierarchical Tokenization for High-Quality Audio
14:11 Kyutai’s Moshi Model for Text + Speech
23:41 Sesame’s CSM-1B Model Architecture
25:13 Orpheus TTS architecture by Canopy Labs
27:34 Inferencing and Cloning with CSM-1B
40:13 Context Aware Text to Speech with CSM-1B
48:21 Orpheus Inference and Cloning - FREE Colab
55:09 Orpheus Voice Cloning Setup
01:01:20 Orpheus Fine-tuning (Full fine-tuning and LoRA fine-tuning)
01:09:55 Running Full Fine Tuning
01:19:33 Running LoRa Fine Tuning
01:25:20 Inference and Comparison
01:29:27 Inference with Cloning AND fine-tuning
01:35:48 The future of token-based multi-modal models
Token-Based Multimodal Models for Text-to-Speech
This article covers recent advances in token-based multimodal models for text-to-speech synthesis, focusing on three key models: CSM-1B, Orpheus, and Moshi.
Core Technical Approach
Token-based models represent both text and audio using discrete tokens, enabling a unified transformer architecture to process multiple modalities. Key aspects:
* Audio is quantized into discrete tokens using learned codebooks
* Multiple hierarchical layers (8-32) encode different audio attributes
* Single transformer processes text and audio tokens together
* Decoder converts output tokens back to audio waveforms
Model Architectures
CSM-1B
* Uses Llama 1B backbone
* 32 hierarchical codebook layers
* Two-stage decoding:
* Main transformer predicts first token
* Smaller decoder generates remaining 31 tokens
* Optimized for real-time generation with audio tokens produced at 12.5Hz
Orpheus
* Built on Llama 3B
* Single codebook shared across hierarchical layers
* Single transformer generates all tokens
* Uses convolutional layers for hierarchical encoding
* Fine-tunable for voice cloning
Moshi
* Custom 7B transformer backbone
* 8 hierarchical codebook layers
* Two-stage architecture:
* Main transformer predicts embeddings
* Decoder converts to 8 tokens per timestep
* Supports real-time conversation
Performance Considerations
The models face key technical challenges:
* Need to generate 100+ tokens/second for real-time speech
* Memory constraints limit model size for real-time use
* Trade-off between audio quality and generation speed
* Hierarchical approaches help balance quality vs speed
Practical Applications
The models enable:
* Text-to-speech synthesis
* Voice cloning with few samples
* Multi-turn conversations
* Emotion preservation
* Speaker style transfer
The technology represents a shift toward unified multimodal architectures, though real-time performance remains an active area of development (especially in open source).
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit trelis.substack.com
Information
- Show
- FrequencyMonthly
- Published24 March 2025 at 15:31 UTC
- Length1h 37m
- RatingClean