Tutorial Description
This tutorial aims to disseminate and promote recent research advancements in multi-modal generative AI, centering on two dominant families of techniques: multi-modal large language models (MLLMs) for understanding and diffusion-based models for visual generation. We will provide a systematic overview of these approaches, covering their probabilistic foundations, architectural designs, and mechanisms for cross-modal interaction, offering participants a comprehensive understanding of how modern multi-modal systems integrate perception, reasoning, and generation.
Beyond core modeling techniques, we formalize out-of-distribution (OOD) environments in post-training as scenarios where the inference-time data distribution differs from that used during model adaptation, resulting in a mismatch between training-time optimality and deployment-time requirements. Such conditions arise from distribution shifts, emerging concepts, and increasingly complex application settings, posing fundamental challenges to generalization. To address these issues, the tutorial explores recent solutions and future directions from two complementary perspectives: generalizable post-training techniques for adapting to OOD environments, and unified multi-modal frameworks that support both understanding and generation for complex tasks in dynamic, open-world scenarios.
Tutorial Outline
The tutorial is scheduled to be 1/4 day (1 hour and 45 minutes) long, and can be organized into the following 5 sections.
Target Audience and Prerequisites
This tutorial will be highly accessible to the whole machine learning community, including researchers, scholars, engineers, and students with related backgrounds in computer vision (CV), natural language processing (NLP), large language models (LLM), artificial intelligence generated content (AIGC), etc., and it is self-contained and designed for introductory and intermediate audiences. No special prerequisite knowledge is required to attend this tutorial.
Tutorial Objective
Multi-modal generative AI has rapidly become a central paradigm in AI research, driven by advances in multi-modal large language models and diffusion-based generation, with broad impact on vision–language reasoning, content creation, and interactive systems. This tutorial will interest a substantial portion of the IJCAI audience by providing a systematic survey of foundational methodologies and emerging challenges in out-of-distribution generalization, while synthesizing post-training adaptation strategies and unified understanding–generation frameworks. It best serves the objectives of introducing major and emerging topics to novices and expert non-specialists, surveying a fast-growing area of AI research, and presenting a novel synthesis that bridges distinct lines of work in multi-modal understanding and generation.
Tutorial Overview
Introduction
Multi-modal generative AI has attracted increasing attention in both academia and industry, with two dominant model families: multi-modal large language models (MLLMs) for multi-modal understanding and diffusion-based models for multi-modal generation. Despite strong performance enabled by large-scale pretraining, real-world deployment occurs in open and evolving environments, where distributional assumptions made are often violated.
We formalize the out-of-distribution (OOD) environment in post-training as follows. Let \( f_{\theta}(y \mid x) \) denote a conditional generative model, and let \( P(x,y) \) be the joint data distribution used for post-training. The model parameters \( \theta_P^{*} \) are obtained by minimizing the expected training task:
At inference time, data are drawn from a different distribution \( Q(x,y) \). Thus we define the OOD environment by the mismatch
which arises whenever \( P(x,y)\neq Q(x,y) \). In practice, this discrepancy may stem from shifts in the input distribution, changes in the conditional generative relationship \( P(y\mid x)\neq Q(y\mid x) \), or the emergence of novel concepts outside the support of the post-training data. Such mismatches cause post-trained models to over-specialize to \( P \), leading to degraded generalization at inference.
Accordingly, this tutorial focuses on:
- Detailed analysis of existing MLLM and diffusion methods.
- Generalizable post-training techniques for adapting multi-modal generative models to OOD environments.
- Unified frameworks that support both multi-modal understanding and generation.
Multi-Modal LLM
Multi-modal large language models have recently become dominant in the field of multi-modal understanding. In this section, we review the literature on multi-modal large language models.
Before discussing detailed MLLM works, we first present preliminaries involving LLM auto-regressive modeling, vision-language pretraining, and visual tokenizers. We then categorize existing MLLM architectures into two branches: early-fusion architectures and alignment architectures, and analyze their potential advantages and disadvantages. Furthermore, we shed light on existing image LLMs and video LLMs and discuss possible future challenges.
Diffusion Models
Diffusion models have become dominant in the field of visual generation. We begin with preliminaries such as DDPM and compare diffusion models with traditional generative models such as VAE and GAN.
We then focus on two important techniques: Latent Diffusion Models and Diffusion Transformers. In text-to-image and text-to-video applications, we discuss several influential works such as DALL-E and Stable Diffusion 3. We elaborate on how they model multi-modal interactions and how these models can be made more controllable with different conditioning modalities.
Generalizable Post-training to New Concepts
A key challenge in multi-modal generative AI is adaptation to new concepts, where users introduce new demands, application scenarios evolve, and data distributions shift — commonly known as OOD or non-IID issues. Pretrained multi-modal foundation models are inherently static, limiting their ability to handle new concepts, emerging subjects, and dynamically changing scenes. In real-world settings, multiple new subjects interact through diverse actions, while their environments continuously evolve. These dynamic challenges raise critical research questions: How can we model and control multiple new subjects, actions, and dynamic scenes effectively? In this section, we discuss generalizable solutions related to disentangled finetuning and curriculum multi-reward reinforcement finetuning.
Unified Multi-Modal Understanding and Generation
After discussing the MLLM and diffusion literature, we focus on unified multi-modal understanding and generation architectures. The central challenge is how to simultaneously support: multi-modal understanding and visual generation We discuss this problem from three perspectives: (i) the probabilistic modeling procedure, (ii) tokenization methods, and (iii) model architecture. From the probabilistic modeling perspective, we present both: pure auto-regressive frameworks and Mixed auto-regressive and diffusion frameworks From the tokenization perspective, we discuss: discrete tokens, continuous tokens, pixel tokens, and semantic tokens. From the model architecture perspective, we discuss how different modalities should be handled and whether large models should adopt dense architectures or Mixture-of-Experts designs.
Future Directions
- Unified GenAI of Video Generation and Understanding. Existing unified multimodal GenAI efforts are mostly confined to image-text settings, making extension to video both natural and necessary.
- Unified Generation and Understanding Benchmark. Existing evaluations of unified generation and understanding are still largely separated. Designing fair and unified benchmarks remains an important open problem.
- Multi-Modal Graph GenAI. Existing multi-modal GenAI mainly focuses on alignment within each instance, such as an image and its caption, while neglecting relations among instances. Integrating graph structures into multi-modal GenAI could become an important direction.
- LightWeight Multi-Modal GenAI. Reducing computational cost and enabling deployment on more devices remains a crucial future direction.
- Embodied Multi-Modal GenAI. Existing systems generally do not interact with the physical world. Future multi-modal generative AI systems should perceive environments, reason and plan, take actions, and continuously improve through interaction.