Trelis Research

Inference, Cloning and Fine-tuning: Orpheus, CSM 1B and Moshi

It’s almost a year since GPT-4o came out - and we haven’t had much that is end-to-end and multi-modal from open source since. The main open source example is Moshi - an incredible paper btw - although performance was quite short compared to GPT-4o.

Now, we’re on the cusp of Llama 4 - which Zuckerberg says will be end-to-end multi-modal - and we are seeing token-based audio models already, from Sesame with CSM-1B and Canopy Labs with Orpheus.

In this video, I explain what end-to-end multi-modal models (i.e. text + speech) can be built using a “token-based” approach. Basically, you convert everything (audio included, using hierarchical tokenisation) into tokens.

Then you just use transformers!!! This makes the models quite a bit easier to handle than complicated diffusion based approaches (like StyleTTS2).

I cover Moshi, CSM-1B (a text to speech model) and Orpheus (also text to speech) and I describe not just how to use CSM-1B and Orpheus models, but also how to do voice cloning AND fine-tuning on Orpheus AND a combo of voice cloning + fine-tuning on Orpheus.

And, all of the scripts are available as part of the ADVANCED-transcription repo.

Cheers, Ronan

PS: I've rotated all hf access keys

Trelis Links:

🤝 Are you a talented developer? Work for Trelis

💡 Need Technical or Market Assistance? Book a Consult Here

💸 Starting a New Project/Venture? Apply for a Trelis Grant

Video Links:

* Slides

* One-click Runpod template (affiliate)

* Llama 3 Paper

* StyleTTS2

* Moshi

* Orpheus

* Sesame’s CSM-1B

* Colab Notebook - Orpheus Cloning

* Colab Notebook - Orpheus Inference

TIMESTAMPS:

00:00 Introduction to End-to-End Audio + Text Models like GPT-4o and Llama 4 (?)

01:04 End-to-End Multimodal Models and Their Capabilities

02:36 Traditional Approaches to Text-to-Speech

03:06 Token-Based Approaches and Their Advantages

03:25 Detailed Look at Orpheus and CSM-1B Models

06:58 Training and Inference with Token-Based Models

12:53 Hierarchical Tokenization for High-Quality Audio

14:11 Kyutai’s Moshi Model for Text + Speech

23:41 Sesame’s CSM-1B Model Architecture

25:13 Orpheus TTS architecture by Canopy Labs

27:34 Inferencing and Cloning with CSM-1B

40:13 Context Aware Text to Speech with CSM-1B

48:21 Orpheus Inference and Cloning - FREE Colab

55:09 Orpheus Voice Cloning Setup

01:01:20 Orpheus Fine-tuning (Full fine-tuning and LoRA fine-tuning)

01:09:55 Running Full Fine Tuning

01:19:33 Running LoRa Fine Tuning

01:25:20 Inference and Comparison

01:29:27 Inference with Cloning AND fine-tuning

01:35:48 The future of token-based multi-modal models

Token-Based Multimodal Models for Text-to-Speech

This article covers recent advances in token-based multimodal models for text-to-speech synthesis, focusing on three key models: CSM-1B, Orpheus, and Moshi.

Core Technical Approach

Token-based models represent both text and audio using discrete tokens, enabling a unified transformer architecture to process multiple modalities. Key aspects:

* Audio is quantized into discrete tokens using learned codebooks

* Multiple hierarchical layers (8-32) encode different audio attributes

* Single transformer processes text and audio tokens together

* Decoder converts output tokens back to audio waveforms

Model Architectures

CSM-1B

* Uses Llama 1B backbone

* 32 hierarchical codebook layers

* Two-stage decoding:

* Main transformer predicts first token

* Smaller decoder generates remaining 31 tokens

* Optimized for real-time generation with audio tokens produced at 12.5Hz

Orpheus

* Built on Llama 3B

* Single codebook shared across hierarchical layers

* Single transformer generates all tokens

* Uses convolutional layers for hierarchical encoding

* Fine-tunable for voice cloning

Moshi

* Custom 7B transformer backbone

* 8 hierarchical codebook layers

* Two-stage architecture:

* Main transformer predicts embeddings

* Decoder converts to 8 tokens per timestep

* Supports real-time conversation

Performance Considerations

The models face key technical challenges:

* Need to generate 100+ tokens/second for real-time speech

* Memory constraints limit model size for real-time use

* Trade-off between audio quality and generation speed

* Hierarchical approaches help balance quality vs speed

Practical Applications

The models enable:

* Text-to-speech synthesis

* Voice cloning with few samples

* Multi-turn conversations

* Emotion preservation

* Speaker style transfer

The technology represents a shift toward unified multimodal architectures, though real-time performance remains an active area of development (especially in open source).



This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit trelis.substack.com