24 MAR
1HR 37MIN

Inference, Cloning and Fine-tuning: Orpheus, CSM 1B and Moshi

It’s almost a year since GPT-4o came out - and we haven’t had much that is end-to-end and multi-modal from open source since. The main open source example is Moshi - an incredible paper btw - although performance was quite short compared to GPT-4o.

Now, we’re on the cusp of Llama 4 - which Zuckerberg says will be end-to-end multi-modal - and we are seeing token-based audio models already, from Sesame with CSM-1B and Canopy Labs with Orpheus.

In this video, I explain what end-to-end multi-modal models (i.e. text + speech) can be built using a “token-based” approach. Basically, you convert everything (audio included, using hierarchical tokenisation) into tokens.

Then you just use transformers!!! This makes the models quite a bit easier to handle than complicated diffusion based approaches (like StyleTTS2).

I cover Moshi, CSM-1B (a text to speech model) and Orpheus (also text to speech) and I describe not just how to use CSM-1B and Orpheus models, but also how to do voice cloning AND fine-tuning on Orpheus AND a combo of voice cloning + fine-tuning on Orpheus.

And, all of the scripts are available as part of the ADVANCED-transcription repo.

Cheers, Ronan

PS: I've rotated all hf access keys

Trelis Links:

🤝 Are you a talented developer? Work for Trelis

💡 Need Technical or Market Assistance? Book a Consult Here

💸 Starting a New Project/Venture? Apply for a Trelis Grant

Video Links:

* Slides

* One-click Runpod template (affiliate)

* Llama 3 Paper

* StyleTTS2

* Moshi

* Orpheus

* Sesame’s CSM-1B

* Colab Notebook - Orpheus Cloning

* Colab Notebook - Orpheus Inference

TIMESTAMPS:

00:00 Introduction to End-to-End Audio + Text Models like GPT-4o and Llama 4 (?)

01:04 End-to-End Multimodal Models and Their Capabilities

02:36 Traditional Approaches to Text-to-Speech

03:06 Token-Based Approaches and Their Advantages

03:25 Detailed Look at Orpheus and CSM-1B Models

06:58 Training and Inference with Token-Based Models

12:53 Hierarchical Tokenization for High-Quality Audio

14:11 Kyutai’s Moshi Model for Text + Speech

23:41 Sesame’s CSM-1B Model Architecture

25:13 Orpheus TTS architecture by Canopy Labs

27:34 Inferencing and Cloning with CSM-1B

40:13 Context Aware Text to Speech with CSM-1B

48:21 Orpheus Inference and Cloning - FREE Colab

55:09 Orpheus Voice Cloning Setup

01:01:20 Orpheus Fine-tuning (Full fine-tuning and LoRA fine-tuning)

01:09:55 Running Full Fine Tuning

01:19:33 Running LoRa Fine Tuning

01:25:20 Inference and Comparison

01:29:27 Inference with Cloning AND fine-tuning

01:35:48 The future of token-based multi-modal models

Token-Based Multimodal Models for Text-to-Speech

This article covers recent advances in token-based multimodal models for text-to-speech synthesis, focusing on three key models: CSM-1B, Orpheus, and Moshi.

Core Technical Approach

Token-based models represent both text and audio using discrete tokens, enabling a unified transformer architecture to process multiple modalities. Key aspects:

* Audio is quantized into discrete tokens using learned codebooks

* Multiple hierarchical layers (8-32) encode different audio attributes

* Single transformer processes text and audio tokens together

* Decoder converts output tokens back to audio waveforms

Model Architectures

CSM-1B

* Uses Llama 1B backbone

* 32 hierarchical codebook layers

* Two-stage decoding:

* Main transformer predicts first token

* Smaller decoder generates remaining 31 tokens

* Optimized for real-time generation with audio tokens produced at 12.5Hz

Orpheus

* Built on Llama 3B

* Single codebook shared across hierarchical layers

* Single transformer generates all tokens

* Uses convolutional layers for hierarchical encoding

* Fine-tunable for voice cloning

Moshi

* Custom 7B transformer backbone

* 8 hierarchical codebook layers

* Two-stage architecture:

* Main transformer predicts embeddings

* Decoder converts to 8 tokens per timestep

* Supports real-time conversation

Performance Considerations

The models face key technical challenges:

* Need to generate 100+ tokens/second for real-time speech

* Memory constraints limit model size for real-time use

* Trade-off between audio quality and generation speed

* Hierarchical approaches help balance quality vs speed

Practical Applications

The models enable:

* Text-to-speech synthesis

* Voice cloning with few samples

* Multi-turn conversations

* Emotion preservation

* Speaker style transfer

The technology represents a shift toward unified multimodal architectures, though real-time performance remains an active area of development (especially in open source).

This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit trelis.substack.com

Episode Webpage

Show

Trelis Research
Frequency

Monthly
Published

24 March 2025 at 15:31 UTC
Length

1h 37m
Rating

Clean

Inference, Cloning and Fine-tuning: Orpheus, CSM 1B and Moshi

Information