Yizeng Han's Homepage

📄 Selected Papers (Full publication list on Google Scholar)

Recent Works Representative Publications

DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation [PDF] [Code]

Wangbo Zhao*, Yizeng Han*, Jiasheng Tang, Kai Wang, Hao Luo, Yibing Song, Gao Huang, Fan Wang, Yang You

Arxiv Preprint, 2025.

We extend DyDiT to T2I (DyFLUX) and video generation. Moreover, LoRA finetuning is supported.

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs [PDF] [Code]

Wangbo Zhao*, Yizeng Han*, Jiasheng Tang, Zhikai Li, Yibing Song, Kai Wang, Zhangyang Wang, Yang You

The Conference on Computer Vision and Pattern Recognition (CVPR), 2025.

We propose to use a small VLM to guide the visual token pruning in a large VLM. Meanwhile, the small VLM can also perform dynamic early exiting to further improve the inference efficiency.

Dynamic Diffusion Transformer [PDF] [Code]

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, Yang You

The International Conference on Learning Representations (ICLR), 2025.

We propose to dynamically adjust the computation of DiT in different timesteps and spatial locations of images. The computation of DiT-XL could be saved by 50% without sacrificing generation quality.

Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [PDF] [Code]

Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You

Conference on Neural Information Processing Systems (NeurIPS), 2024.

We propose to adapt static ViT to dynamic ViT via parameter-efficient fine-tuning without full-parameter tuning.

Demystify Mamba in Vision: A Linear Attention Perspective [PDF] [Code]

Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang

Conference on Neural Information Processing Systems (NeurIPS), 2024.

By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba’s success. Based on these findings, we propose a Mamba-Like Linear Attention (MLLA) model.

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [PDF] [Code]

Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang

Conference on Neural Information Processing Systems (NeurIPS), 2024.

We propose dynamic early exiting in Robot MLLMs.

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis [PDF] [Code]

Zanlin Ni, Yulin Wang, Renping Zhou, Yizeng Han, Jiayi Guo, Zhiyuan Liu, Yuan Yao, Gao Huang

Conference on Neural Information Processing Systems (NeurIPS), 2024.

We propose EfficientNAT(ENAT), a NAT model that explicitly encourages critical interactions inherent in non-autoregressive Transformers.

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [PDF]

Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li

European Conference on Computer Vision (ECCV), 2024

We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail.

Agent Attention: On the Integration of Softmax and Linear Attention [PDF] [Code]

Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang

European Conference on Computer Vision (ECCV), 2024

We propose Agent Attention, a linear attention mechanism in vision recognition and generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature.

Latency-aware Unified Dynamic Networks for Efficient Image Recognition [PDF] [Code] [将门创投]

Yizeng Han*, Zeyu Liu*, Zhihang Yuan*, Yifan Pu, Chaofei Wang, Shiji Song, Gao Huang

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, IF=24.314), 2024

We propose Latency-aware Unified Dynamic Networks (LAUDNet), a comprehensive framework that amalgamates three cornerstone dynamic paradigms—spatially-adaptive computation, dynamic layer skipping, and dynamic channel skipping—under a unified formulation.

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection [PDF] [Code]

Le Yang, Ziwei Zheng, Yizeng Han, Hao Cheng, Shiji Song, Gao Huang, Fan Li

European Conference on Computer Vision (ECCV), 2024

In the temporal action detection (TAD) task, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations.

SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning [PDF] [Code]

Chaoqun Du*, Yizeng Han*, Gao Huang

International Conference on Machine Learning (ICML), 2024

We focus on a realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data is unknown and mismatched. The proposed SimPro does not rely on any predefined assumptions about the distribution of unlabeled data.

GSVA: Generalized Segmentation via Multimodal Large Language Models [PDF] [Code]

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

We propose Generalized Segmentation Vision Assistant (GSVA) to address the issues of multi-object and empty-object in Generalized Referring Expression Segmentation (GRES).

Dynamic Neural Networks: A Survey [PDF] [智源社区] [机器之心-在线讲座] [Bilibili] [Slides]

Yizeng Han*, Gao Huang*, Shiji Song, Le Yang, Honghui Wang, Yulin Wang

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, IF=24.314), 2021

In this survey, we comprehensively review the rapidly developing area, dynamic neural networks. The important research problems, e.g., architecture design, decision making scheme, and optimization technique, are reviewed systematically. We also discuss the open problems in this field together with interesting future research directions.

Latency-aware Unified Dynamic Networks for Efficient Image Recognition [PDF] [Code] [将门创投]

Yizeng Han*, Zeyu Liu*, Zhihang Yuan*, Yifan Pu, Chaofei Wang, Shiji Song, Gao Huang

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, IF=24.314), 2024

Dynamic Diffusion Transformer [PDF] [Code]

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, Yang You

The International Conference on Learning Representations (ICLR), 2025.

We propose to dynamically adjust the computation of DiT in different timesteps and spatial locations of images. The computation of DiT-XL could be saved by 50% without sacrificing generation quality.

Dynamic Perceiver for Efficient Visual Recognition [PDF] [Code] [Youtube]

Yizeng Han*, Dongchen Han*, Zeyu Liu, Yulin Wang, Xuran Pan, Yifan Pu, Chao Deng, Junlan Feng, Shiji Song, Gao Huang

IEEE/CVF International Conference on Computer Vision (ICCV), 2023

We propose Dynamic Perceiver (Dyn-Perceiver), a general framework which can be conveniently implemented on top of any visual backbones. It explicitly decouples feature extraction and early classification. We show that early classifiers can be constructed in the classification branch without harming the performance of the last classifier. Experiments demonstrate that Dyn-Perceiver significantly outperforms existing state-of-the-art methods in terms of the trade-off between accuracy and efficiency.

Learning to Weight Samples for Dynamic Early-exiting Networks [PDF] [Code] [Youtube]

Yizeng Han*, Yifan Pu*, Zihang Lai, Chaofei Wang, Shiji Song, Junfen Cao, Wenhui Huang, Chao Deng, Gao Huang

European Conference on Computer Vision (ECCV), 2022

We propose to bridge the gap between training and testing of dynamic early-exiting networks by sample weighting. By bringing the adaptive behavior during inference into the training phase, we show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency.

Latency-aware Spatial-wise Dynamic Networks [PDF] [Code]

Yizeng Han*, Zhihang Yuan*, Yifan Pu*, Chenhao Xue, Shiji Song, Guangyu Sun, Gao Huang

Conference on Neural Information Processing Systems (NeurIPS), 2022

We use a latency predictor to guide both algorithm design and scheduling optimization of spatial-wise dynamic networks on various hardware platforms. We show that "coarse-grained" spatially adaptive computation can effectively reduce the memory access cost and shows superior efficiency than pixel-level dynamic operations.

Resolution Adaptive Networks for Efficient Inference [PDF] [Code]

Le Yang*, Yizeng Han*, Xi Chen*, Shiji Song, Jifeng Dai, Gao Huang

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

We focus on the spatial redundancy of images, and propose a novel Resolution Adaptive Network (RANet), which is inspired by the intuition that low-resolution representations are sufficient for classifying “easy” inputs, while only some “hard” samples need spatially detailed information. Empirically, we demonstrate the effectiveness of the proposed RANet in both the anytime prediction setting and the budgeted batch classification setting.

Spatially Adaptive Feature Refinement for Efficient Inference [PDF]

Yizeng Han, Gao Huang, Shiji Song, Le Yang, Yitian Zhang, Haojun Jiang

IEEE Transactions on Image Processing (TIP, IF=11.041), 2021

We propose to perform efficient inference by adaptively fusing information from two branches: one conducts standard convolution on inputs at a lower resolution, and the other one selectively refines a set of regions at the original resolution. Experiments on classification, object detection and semantic segmentation validate that SAR can consistently improve the network performance and efficiency.

SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning [PDF] [Code]

Chaoqun Du*, Yizeng Han*, Gao Huang

International Conference on Machine Learning (ICML), 2024

Fine-grained Recognition with Learnable Semantic Data Augmentation [PDF] [Code]

Yifan Pu*, Yizeng Han*, Yulin Wang, Junlan Feng, Chao Deng, Gao Huang

IEEE Transactions on Image Processing (TIP), 2023

We propose diversifying the training data at the feature space to alleviate the discriminative region loss problem in fine-grained image recognition. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a sample-wise covariance prediction network.

🎖 Awards

First Prize of CSIG Natural Science Award (中国图象图形学会自然科学奖一等奖), 第四完成人, 清华大学 & 阿里巴巴达摩院, 2024

Outstanding Doctoral Dissertation of Tsinghua University (清华大学优秀博士学位论文), Tsinghua University, 2024

Outstanding Graduate of Tsinghua University (清华大学优秀毕业生), Tsinghua University, 2024

Outstanding Graduate of Beijing (北京市优秀毕业生), Beijing Municipal Education Commission, 2024

Comprehensive Excellence Scholarship (综合奖学金), Tsinghua University, 2023

National Scholarship (国家奖学金), Ministry of Education of China, 2022

Comprehensive Excellence Scholarship (综合奖学金), Tsinghua University, 2017

Comprehensive Excellence Scholarship (综合奖学金), Tsinghua University, 2016

Academic Excellence Scholarship (学业优秀奖学金), Tsinghua University, 2015

📧 Contact

hanyizeng.hyz at alibaba-inc dot com

yizeng38 at gmail dot com

Yizeng Han (韩益增)

🧑‍🎓 Bio

📚 Education

💡 Research Experience

🔥 News

📄 Selected Papers (Full publication list on Google Scholar)

DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation [PDF] [Code]

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs [PDF] [Code]

Dynamic Diffusion Transformer [PDF] [Code]

Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [PDF] [Code]

Demystify Mamba in Vision: A Linear Attention Perspective [PDF] [Code]

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [PDF] [Code]

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis [PDF] [Code]

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [PDF]

Agent Attention: On the Integration of Softmax and Linear Attention [PDF] [Code]

Latency-aware Unified Dynamic Networks for Efficient Image Recognition [PDF] [Code] [将门创投]

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection [PDF] [Code]

SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning [PDF] [Code]

GSVA: Generalized Segmentation via Multimodal Large Language Models [PDF] [Code]

Dynamic Neural Networks: A Survey [PDF] [智源社区] [机器之心-在线讲座] [Bilibili] [Slides]

Latency-aware Unified Dynamic Networks for Efficient Image Recognition [PDF] [Code] [将门创投]

Dynamic Diffusion Transformer [PDF] [Code]

Dynamic Perceiver for Efficient Visual Recognition [PDF] [Code] [Youtube]

Learning to Weight Samples for Dynamic Early-exiting Networks [PDF] [Code] [Youtube]

Latency-aware Spatial-wise Dynamic Networks [PDF] [Code]

Resolution Adaptive Networks for Efficient Inference [PDF] [Code]

Spatially Adaptive Feature Refinement for Efficient Inference [PDF]

SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning [PDF] [Code]

Fine-grained Recognition with Learnable Semantic Data Augmentation [PDF] [Code]

🎖 Awards

📧 Contact