In this episode, we discuss Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model by Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy. The paper introduces Transfusion, a method for training multi-modal models using a combination of language modeling and diffusion on mixed-modality sequences. Transfusion models, with up to 7B parameters, show superior scaling and performance on uni- and cross-modal benchmarks compared to traditional image token quantization methods. Additionally, the use of modality-specific encoding and decoding layers allows for significant improvements, enabling high-quality image and text generation.
首页 >
Figure 3 from Transfusion Predict the Next Token and Diffuse Images > arxiv preprint