Out-of-distribution Generalized Generative AI

IJCAI-ECAI 2026, Bremen, Germany

Speakers

Xin Wang Tsinghua University, China

Xin Wang is currently an Associate Professor at the Department of Computer Science and Technology, Tsinghua University. He got both his Ph.D. and B.E degrees in Computer Science and Technology from Zhejiang University, China. He holds a Ph.D. degree in Computing Science from Simon Fraser University, Canada. His research interests include multimedia intelligence, machine learning and its applications in multimedia big data analysis. He has published over 150 high-quality research papers in top journals and conferences including IEEE TPAMI, IEEE TKDE, ACM TOIS, ICML, NeurIPS, ACM KDD, ACM Web Conference, ACM SIGIR and ACM Multimedia etc., winning three best paper awards. He is the recipient of 2020 ACM China Rising Star Award, 2022 IEEE TCMC Rising Star Award and 2023 DAMO Academy Young Fellow.

Zirui Pan Tsinghua University, China

Zirui Pan is currently a Ph.D. student at the Department of Computer Science and Technology, Tsinghua University. He received his B.E. degree from the Department of Computer Science and Technology, Tsinghua University. His main research interests include curriculum learning, disentangled representation learning, and multi-modal generative AI.

Yuwei Zhou Tsinghua University, China

Yuwei Zhou is currently a Ph.D. student at the Department of Computer Science and Technology, Tsinghua University. He received his B.E. degree from the Department of Computer Science and Technology, Tsinghua University. His main research interests include machine learning, curriculum learning, and multi-modal generative AI.

Wenwu Zhu Tsinghua University, China

Wenwu Zhu is currently a Professor in the Department of Computer Science and Technology at Tsinghua University, the Vice Dean of National Research Center for Information Science and Technology, and the Vice Director of Tsinghua Center for Big Data. Prior to his current post, he was a Senior Researcher and Research Manager at Microsoft Research Asia. He was the Chief Scientist and Director at Intel Research China from 2004 to 2008. He worked at Bell Labs New Jersey as Member of Technical Staff during 1996-1999. He received his Ph.D. degree from New York University in 1996.

His research interests include graph machine learning, curriculum learning, data-driven multimedia, big data. He has published over 400 referred papers, and is inventor of over 80 patents. He received ten Best Paper Awards, including ACM Multimedia 2012 and IEEE Transactions on Circuits and Systems for Video Technology in 2001 and 2019.

He serves as the EiC for IEEE Transactions on Circuits and Systems for Video Technology, the EiC for IEEE Transactions on Multimedia (2017-2019) and the Chair of the steering committee for IEEE Transactions on Multimedia (2020-2022). He serves as General Co-Chair for ACM Multimedia 2018 and ACM CIKM 2019. He is an AAAS Fellow, IEEE Fellow, ACM Fellow, SPIE Fellow, and a member of Academia Europaea.

Tutorial Description

This tutorial aims to disseminate and promote recent research advancements in multi-modal generative AI, centering on two dominant families of techniques: multi-modal large language models (MLLMs) for understanding and diffusion-based models for visual generation. We will provide a systematic overview of these approaches, covering their probabilistic foundations, architectural designs, and mechanisms for cross-modal interaction, offering participants a comprehensive understanding of how modern multi-modal systems integrate perception, reasoning, and generation.

Beyond core modeling techniques, we formalize out-of-distribution (OOD) environments in post-training as scenarios where the inference-time data distribution differs from that used during model adaptation, resulting in a mismatch between training-time optimality and deployment-time requirements. Such conditions arise from distribution shifts, emerging concepts, and increasingly complex application settings, posing fundamental challenges to generalization. To address these issues, the tutorial explores recent solutions and future directions from two complementary perspectives: generalizable post-training techniques for adapting to OOD environments, and unified multi-modal frameworks that support both understanding and generation for complex tasks in dynamic, open-world scenarios.


Tutorial Outline

The tutorial is scheduled to be 1/4 day (1 hour and 45 minutes) long, and can be organized into the following 5 sections.

  • Introduction on multi-modal generative AI and out-of-distribution generation.
  • Basics and fundamentals of multi-modal LLM and diffusion models
  • Generalizable Post-training to New Concepts
  • Unified Multi-Modal Understanding and Generation
  • Discussions and future directions

  • Target Audience and Prerequisites

    This tutorial will be highly accessible to the whole machine learning community, including researchers, scholars, engineers, and students with related backgrounds in computer vision (CV), natural language processing (NLP), large language models (LLM), artificial intelligence generated content (AIGC), etc., and it is self-contained and designed for introductory and intermediate audiences. No special prerequisite knowledge is required to attend this tutorial.


    Tutorial Objective

    Multi-modal generative AI has rapidly become a central paradigm in AI research, driven by advances in multi-modal large language models and diffusion-based generation, with broad impact on vision–language reasoning, content creation, and interactive systems. This tutorial will interest a substantial portion of the IJCAI audience by providing a systematic survey of foundational methodologies and emerging challenges in out-of-distribution generalization, while synthesizing post-training adaptation strategies and unified understanding–generation frameworks. It best serves the objectives of introducing major and emerging topics to novices and expert non-specialists, surveying a fast-growing area of AI research, and presenting a novel synthesis that bridges distinct lines of work in multi-modal understanding and generation.


    Tutorial Overview

    Introduction

    Multi-modal generative AI has attracted increasing attention in both academia and industry, with two dominant model families: multi-modal large language models (MLLMs) for multi-modal understanding and diffusion-based models for multi-modal generation. Despite strong performance enabled by large-scale pretraining, real-world deployment occurs in open and evolving environments, where distributional assumptions made are often violated.

    We formalize the out-of-distribution (OOD) environment in post-training as follows. Let \( f_{\theta}(y \mid x) \) denote a conditional generative model, and let \( P(x,y) \) be the joint data distribution used for post-training. The model parameters \( \theta_P^{*} \) are obtained by minimizing the expected training task:

    $$ \theta_P^{*} = \arg\min_{\theta} \mathbb{E}_{(x,y)\sim P} \left[ \mathcal{L}(f_{\theta}(y\mid x), y) \right]. $$

    At inference time, data are drawn from a different distribution \( Q(x,y) \). Thus we define the OOD environment by the mismatch

    $$ \mathcal{R}_Q(\theta_P^{*}) \neq \min_{\theta}\mathcal{R}_Q(\theta), \quad \text{where } \mathcal{R}_Q(\theta)= \mathbb{E}_Q \left[ \mathcal{L}(f_{\theta}(y\mid x), y) \right], $$

    which arises whenever \( P(x,y)\neq Q(x,y) \). In practice, this discrepancy may stem from shifts in the input distribution, changes in the conditional generative relationship \( P(y\mid x)\neq Q(y\mid x) \), or the emergence of novel concepts outside the support of the post-training data. Such mismatches cause post-trained models to over-specialize to \( P \), leading to degraded generalization at inference.

    Accordingly, this tutorial focuses on:

    Multi-Modal LLM

    Multi-modal large language models have recently become dominant in the field of multi-modal understanding. In this section, we review the literature on multi-modal large language models.

    Before discussing detailed MLLM works, we first present preliminaries involving LLM auto-regressive modeling, vision-language pretraining, and visual tokenizers. We then categorize existing MLLM architectures into two branches: early-fusion architectures and alignment architectures, and analyze their potential advantages and disadvantages. Furthermore, we shed light on existing image LLMs and video LLMs and discuss possible future challenges.

    Diffusion Models

    Diffusion models have become dominant in the field of visual generation. We begin with preliminaries such as DDPM and compare diffusion models with traditional generative models such as VAE and GAN.

    We then focus on two important techniques: Latent Diffusion Models and Diffusion Transformers. In text-to-image and text-to-video applications, we discuss several influential works such as DALL-E and Stable Diffusion 3. We elaborate on how they model multi-modal interactions and how these models can be made more controllable with different conditioning modalities.

    Generalizable Post-training to New Concepts

    A key challenge in multi-modal generative AI is adaptation to new concepts, where users introduce new demands, application scenarios evolve, and data distributions shift — commonly known as OOD or non-IID issues. Pretrained multi-modal foundation models are inherently static, limiting their ability to handle new concepts, emerging subjects, and dynamically changing scenes. In real-world settings, multiple new subjects interact through diverse actions, while their environments continuously evolve. These dynamic challenges raise critical research questions: How can we model and control multiple new subjects, actions, and dynamic scenes effectively? In this section, we discuss generalizable solutions related to disentangled finetuning and curriculum multi-reward reinforcement finetuning.

    Unified Multi-Modal Understanding and Generation

    After discussing the MLLM and diffusion literature, we focus on unified multi-modal understanding and generation architectures. The central challenge is how to simultaneously support: multi-modal understanding and visual generation We discuss this problem from three perspectives: (i) the probabilistic modeling procedure, (ii) tokenization methods, and (iii) model architecture. From the probabilistic modeling perspective, we present both: pure auto-regressive frameworks and Mixed auto-regressive and diffusion frameworks From the tokenization perspective, we discuss: discrete tokens, continuous tokens, pixel tokens, and semantic tokens. From the model architecture perspective, we discuss how different modalities should be handled and whether large models should adopt dense architectures or Mixture-of-Experts designs.

    Future Directions