Back to Basics: Let Denoising Generative Models Denoise

Best AI papers explained - Un podcast de Enoch H. Kang

Catégories:

This academic paper, introduces "Just image Transformers" (JiT), a novel approach to denoising diffusion models that advocates for directly predicting clean data (**x-prediction**) rather than predicting noise or a noised quantity. The authors argue this shift is critical based on the **manifold assumption**, which posits that clean data lies on a low-dimensional manifold while noise is inherently off-manifold. Experiments, including a toy model and high-resolution ImageNet generation using plain Vision Transformers (ViT), demonstrate that x-prediction successfully handles high-dimensional spaces where conventional noise-predicting methods catastrophically fail. This research emphasizes a return to first principles for a self-contained **"Diffusion + Transformer"** paradigm on raw pixel data, without relying on complex architectures, pre-training, or auxiliary losses. Ultimately, the paper provides extensive ablation studies on loss combinations and architectural components to validate that **x-prediction** is fundamentally more tractable for limited-capacity networks in high-dimensional generative modeling.

Visit the podcast's native language site