This academic paper, introduces "Just image Transformers" (JiT), a novel approach to denoising diffusion models that advocates for directly predicting clean data (**x-prediction**) rather than predicting noise or a noised quantity. The authors argue this shift is critical based on the **manifold assumption**, which posits that clean data lies on a low-dimensional manifold while noise is inherently off-manifold. Experiments, including a toy model and high-resolution ImageNet generation using plain Vision Transformers (ViT), demonstrate that x-prediction successfully handles high-dimensional spaces where conventional noise-predicting methods catastrophically fail. This research emphasizes a return to first principles for a self-contained **"Diffusion + Transformer"** paradigm on raw pixel data, without relying on complex architectures, pre-training, or auxiliary losses. Ultimately, the paper provides extensive ablation studies on loss combinations and architectural components to validate that **x-prediction** is fundamentally more tractable for limited-capacity networks in high-dimensional generative modeling.
정보
- 프로그램
- 주기매주 업데이트
- 발행일2025년 11월 23일 오전 4:45 UTC
- 길이15분
- 등급전체 연령 사용가
