Multi-Modal Generative AI

Multi-Modal Generative AI in Dynamic and Open Environment Xin Wang, Hong Chen, Yuwei Zhou, Wenwu Zhu Media and Network Lab, Tsinghua University

# Abstract

This tutorial aims to disseminate and promote recent research advancements in multi-modal generative AI, focusing on two dominant families of techniques: multi-modal large language models (MLLMs) for understanding and diffusion models for visual generation. We will provide a systematic discussion of MLLMs and multi-modal diffusion models, covering their probabilistic modeling methods, architectures, and multi-modal interaction mechanisms.

In dynamic and open environments, shifting data distributions, emerging concepts, and evolving complex application scenarios create significant obstacles for multi-modal generative models. This tutorial explores solutions and future directions to address these challenges from two aspects: one is generalizable post-training techniques to adapt multi-modal generative models to new concepts, and the other is the development of a unified multi-modal generation and understanding framework for complex multi-modal tasks.

Keywords: Multi-Modal Generative AI; Dynamic and Open Environment

# Description

This tutorial explores recent advancements in multi-modal generative AI, including multi-modal large language models (MLLMs) and diffusion models, and also highlights the emerging challenges that exist when applying them to dynamic and open environments, where generalizable post-training techniques and unified multi-modal understanding and generation framework will be discussed.

# Tutorial Length

Our proposed length of the tutorial is 1/4 day.

# Outline

# Introduction

Multi-modal generative AI has received increasing attention in both academia and industry. Particularly, two dominant families of models are: i) The multi-modal large language model (MLLM) such as GPT-4V, which shows the impressive ability for multi-modal understanding; ii) The diffusion model such as Sora, which exhibits remarkable multi-modal powers, especially with respect to visual generation. Through pretraining on large-scale multi-modal data, these models have shown impressive abilities. However, as the real world is a dynamic and open environment, data distribution shifts often occur, new concepts always emerge, and the user's requirements may become more and more complex. Therefore, there arise two important research problems: (i) how to adapt the pretrained multi-modal generative models to new concepts in the dynamic environment through generalizable post-training techniques, and (ii) how to follow the user's complex instructions through building a unified model that simultaneously supports multi-modal generation and understanding.

To elaborate on these problems, we organize our tutorial as follows: (i) A detailed analysis of existing MLLM works, (ii) a detailed review of existing diffusion models, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video large language models as well as text-to-image/video generation. (iii) Generalizable post-training techniques for multi-modal generative AI to adapt to new concepts. (iv) Development of unified multi-modal understanding and generation framework, including the probablistic modeling, tokenization methods, model architectures, etc.

Multi-modal large language models have recently become dominant in the field of multi-modal understanding. In this section, we will review the literature on the multi-modal large language models. Before discussing detailed MLLM works, we will first present some preliminaries involving LLM auto-regressive modeling, vision-language pretraining, and visual tokenizers. Then, we categorize existing MLLM architectures into two branches, the early-fusion architectures and alignment architectures, and analyze the potential advantages and disadvantages of the two branches of techniques. Furthermore, we shed light on existing image LLMs and video LLMs and indicate the possible challenges in these applications.

Figure 2: Illustration of MLLM architecture.

# Diffusion Models

Diffusion models have become dominant in the field of visual generation. In this section, we will begin with the preliminaries with diffusion models such as DDPM and compare them with traditional generative models such as VAE and GAN. Then, we focus on two important techniques, latent diffusion models and diffusion transformers. After that, in the text-to-image and text-to-video applications, we discuss several well-known works such as DALL-E, Stable Diffusion3, etc. We elaborate on how they model the multi-modal interactions in the diffusion model and how we make the diffusion models more controllable with different modalities of condition.

# Generalizable Post-training to New Concepts

A key challenge in multi-modal generative AI is its adaptation to new concepts, where users introduce new demands, application scenarios evolve, and data distributions shift—commonly known as OOD (Out-of-Distribution) or non-IID issues. Pretrained multi-modal foundation models are inherently static, limiting their ability to handle new concepts, emerging subjects, and dynamically changing scenes. In real-world settings, multiple new subjects interact through diverse actions, while their environments continuously evolve. These dynamic challenges raise critical research questions: How can we model and control multiple new subjects, actions, and dynamic scenes effectively? In this section, we will discuss some generalizable solutions related to disentangled finetuning and curriculum multi-reward reinforcement finetuning for the challenges.

After discussing the MLLM and diffusion literature, in this section, we mainly focus on the unified multi-modal understanding and generation framework. The challenge to construct the unified framework is how to serve the two objectives simultaneously, i.e., multi-modal understanding and visual generation, without conflicts. We mainly discuss this trending problem from the following three perspectives: (i) the probabilistic modeling procedure, (ii) the tokenization ways, and (iii) the model architecture. From the probabilistic modeling procedure, we provide a pure auto-regressive framework and a mixed auto-regressive and diffusion framework. From the tokenization methods, we discuss the discrete tokens, continuous tokens, pixel tokens, and semantic tokens. From the model architecture, we discuss what model should we use to tackle the different modalities of input information and how should we construct the large model, in the form of a dense model or an MoE (Mixture of experts).

Figure 4: Illustration of unified auto-regressive and diffusion model.

# Future Directions

We discuss future directions from the following perspectives:

Unified GenAI of Video Generation and Understanding. Existing attempts of unified multimodal GenAI are limited to the image-text domain, and extending it to the video domain is a natural and necessary topic.

Unified Generation and Understanding Benchmark. Despite some pioneering work on studying unified generation and understanding models, their evaluations of these tasks are completely separated in a non-unified way. How to design a benchmark to fairly evaluate the effectiveness of different methods is an important problem.

Multi-Modal Graph GenAI. Existing multi-modal GenAI mainly focuses on the alignment within each instance, such as an image and its caption, neglecting the relations among different instances, including graphs into multi-modal GenAI to tackle the problem could be an interesting future direction.

LightWeight Multi-Modal GenAI. To save computational resources and deploy the multi-modal GenAI on more devices, lightweighting the multi-modal generative models is an important future direction.

Embodied Multi-Modal GenAI. Existing multi-modal GenAI generally does not interact with the physical world. In the future, the multi-modal generative AI should behave like humans, where it can perceive the multi-modal environment, reason and plan based on the perception and its state, take action, and improve itself.

# Q&A

This tutorial includes 15 minutes for questioning and answering. We welcome any question from the audiences.

# Target Audience

Our target audience is the general AI community, especially researchers who are interested in generative AI, multi-modality, multi-modal large language models, and diffusion models.

# Tutorial Objectives

The tutorial focuses on the most recent advancements in multi-modal generative AI, such as GPT-4V and Sora, and the trending topic, unified multi-modal generation and understanding framework. The audience will access the probabilistic modeling methods of multi-modal generative AI such as auto-regressive and diffusion, and also the model architectures such as early-fusion and alignment-based architecture, and some advanced applications such as image LLM, video LLM, text-to-image generation and text-to-video generation.

# CV of the presenters

Wenwu Zhu is currently a Professor in the Department of Computer Science and Technology at Tsinghua University, the Vice Dean of National Research Center for Information Science and Technology, and the Vice Director of Tsinghua Center for Big Data. His Google Scholar page is https://scholar.google.com/citations?user=7t2jzpgAAAAJ. Prior to his current post, he was a Senior Researcher and Research Manager at Microsoft Research Asia. He was the Chief Scientist and Director at Intel Research China from 2004 to 2008. He worked at Bell Labs New Jersey as Member of Technical Staff during 1996-1999. He received his Ph.D. degree from New York University in 1996. His research interests include graph machine learning, curriculum learning, data-driven multimedia, big data. He has published over 400 referred papers, and is inventor of over 80 patents. He received ten Best Paper Awards, including ACM Multimedia 2012 and IEEE Transactions on Circuits and Systems for Video Technology in 2001 and 2019. He serves as the EiC for IEEE Transactions on Circuits and Systems for Video Technology, the EiC for IEEE Transactions on Multimedia (2017-2019) and the Chair of the steering committee for IEEE Transactions on Multimedia (2020-2022). He serves as General Co-Chair for ACM Multimedia 2018 and ACM CIKM 2019. He is an AAAS Fellow, IEEE Fellow, ACM Fellow, SPIE Fellow, and a member of Academia Europaea.

Xin Wang (http://mn.cs.tsinghua.edu.cn/xinwang/) is currently an Associate Professor at the Department of Computer Science and Technology, Tsinghua University. He got both his Ph.D. and B.E degrees in Computer Science and Technology from Zhejiang University, China. He holds a Ph.D. degree in Computing Science from Simon Fraser University, Canada. His research interests include multimedia intelligence, machine learning and its applications in multimedia big data analysis. He has published over 150 high-quality research papers in top journals and conferences including IEEE TPAMI, IEEE TKDE, ACM TOIS, ICML, NeurIPS, ACM KDD, ACM Web Conference, ACM SIGIR and ACM Multimedia etc., winning three best paper awards. He is the recipient of 2020 ACM China Rising Star Award, 2022 IEEE TCMC Rising Star Award and 2023 DAMO Academy Young Fellow.

Hong Chen is currently a Ph.D. student at the Department of Computer Science and Technology, Tsinghua University. He received his B.E. degree from the Department of Electronic Engineering, Tsinghua University. His main research interests include machine learning, curriculum learning, auxiliary learning, and multi-modal generative AI. He has published high-quality research papers in ICML, NeurIPS, IEEE TPAMI, ACM KDD, WWW, ACM Multimedia, etc.

Yuwei Zhou is currently a Ph.D. student at the Department of Computer Science and Technology, Tsinghua University. He received his B.E. degree from the Department of Computer Science and Technology, Tsinghua University. His main research interests include machine learning, curriculum learning, and multi-modal generative AI.

# Research Achievements

As the presenters of this tutorial, both Wenwu Zhu and Xin Wang have been deeply involved in the relevant research with a considerable number of recent publications, including a survey paper on the topic of multi-modal generative AI (https://arxiv.org/abs/2409.14993) submitted to IEEE TMM (1) and technical papers in top-tier conferences and journals covering both multi-modal understanding (2-10) and multi-modal generation (11-17).

Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond. Submitted to IEEE TMM.
Identity-Text Video Corpus Grounding. In AAAI, 2025.
VTimeLLM: Empower LLM to Grasp Video Moments. In CVPR, 2024.
Neighbor Does Matter: Curriculum Global Positive-Negative Sampling for Vision-Language Pre-training. In ACM Multimedia, 2024.
Large Language Model With Curriculum Reasoning for Visual Concept Recognition. In ACM KDD, 2024.
LLM4DyG: Can LLMs Solve Spatial-Temporal Problems on Dynamic Graphs? In ACM KDD, 2024.
RealTCD: Temporal Causal Discovery from Interventional Data with Large Language Model. In ACM CIKM, 2024.
Automated Disentangled Sequential Recommendation with Large Language Models. In ACM TOIS, 2024.
Dynamic Spatio-Temporal Graph Reasoning for VideoQA with Self-Supervised Event Recognition. In IEEE TIP, 2024.
VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding. In NeurIPS, 2024.
Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM. In AAAI, 2025.
DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation. In ICLR, 2024.
DisenStudio:Customized Multi-Subject Text-to-Video Generation with Disentangled Spatial Control. In ACM Multimedia, 2024.
Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models. In ECCV, 2024.
DisenDreamer: Subject-Driven Text-to-Image Generation with Sample-aware Disentangled Tuning. In IEEE TCSVT, 2024.
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models. In IEEE TMM, 2024.
ScenarioDiff: Text-to-video Generation with Dynamic Transformations of Scene Conditions. In IJCV, 2024.

# Previous Tutorials

The presenters have given 19 tutorials on machine learning related topics over the past 5 years, with the following titles and presenting conferences.

"Graph Machine Learning under Distribution Shifts: Adaptation, Generalization and Extension to LLM" in ACM Web Conference 2025.
"Curriculum Learning in the Era of Large Language Models" in AAAI 2025.
"Graph Machine Learning under Distribution Shifts: Adaptation, Generalization and Extension to LLM" in AAAI 2025.
"Curriculum Learning for Multimedia in the Era of LLM" in ACM Multimedia 2024.
“Curriculum Learning: Theories, Approaches, Applications, Tools, and Future Directions in the Era of Large Language Models" in IJCAI 2024.
"Graph Machine Learning under Distribution Shifts: Adaptation, Generalization and Extension to LLM" in IJCAI 2024.
"Curriculum Learning: Theories, Approaches, Applications, Tools, and Future Directions in the Era of Large Language Models" in ACM Web Conference 2024.
"Disentangled Representation Learning" in AAAI 2024.
"Towards Out-of-Distribution Generalization on Graphs" in AAAI 2024.
"Curriculum Learning: Theories, Approaches, Applications and Tools" in AAAI 2024.
"Disentangled Representation Learning for Multimedia" in ACM Multimedia 2023.
"Towards Out-of-Distribution Generalization on Graphs" in IJCAI 2023.
"Towards Out-of-Distribution Generalization on Graphs" in ACM Web Conference 2023.
"Video Grounding and Its Generalization" in ACM Multimedia 2022.
"Disentangled Representation Learning: Approaches and Applications" in IJCAI 2022.
"Out-of-Distribution Generalization and Its Applications for Multimedia" in ACM Multimedia 2021.
"Automated Machine Learning on Graph" in ACM SIGKDD 2021.
"Meta-learning and AutoML: Approaches and Applications" in IJCAI 2020.
"Multimedia Intelligence: When Multimedia Meets Artificial Intelligence" in ACM Multimedia 2020.

# Teaching

Xin Wang and Wenwu Zhu are faculties at department of computer science and technology, Tsinghua University. They have rich experiences in teaching both undergraduate and graduate level courses.