Yizeng Han (韩益增)

🧑‍🎓 Bio

I'm a research scientist at Alibaba DAMO Academy, Beijing, China. I received by Ph.D degree in the Department of Automation, Tsinghua University, advised by Prof. Gao Huang and Prof. Shiji Song. Download my C.V. here: English / 简体中文.
🌟 My research focuses on deep learning and computer vision, in particular dynamic neural networks and efficient learning/inference of deep models in resource-constrained scenarios.
🔥 Recently, I am interested in directions related to Efficient/Dynamic Vision Language Model (VLM) and Visual Generation.
🧐 I'm also interested in fundamental machine learning problems, such as semi-supervised long-tailed learning and fine-grained learning.

📚 Education

  • Ph.D, Tsinghua University, 2018 - 2024.
  • B.E., Tsinghua University, 2014 - 2018.

💡 Research Experience

  • Research Intern, Megvii Technology (Foundation Model Group, advisor: Xiangyu Zhang), 04/2023 - 12/2023
  • Research Intern, Georgia Institute of Technology (advisor: Gregory D. Abowd), 06/2017 - 08/2017
Your Photo

🔥 News

  • 11/2024: 🎉 First Prize of CSIG Natural Science Award (中国图象图形学会自然科学奖一等奖).
  • 09/2024: 🎉 Five works are accepted at NeurIPS 2024.
  • 07/2024: 🎉 Four works are accepted at ECCV 2024.
  • 06/2024: 🎉 Awarded by Outstanding Graduate of Tsinghua University (清华大学优秀毕业生), Outstanding Doctoral Dissertation of Tsinghua University (清华大学优秀博士学位论文), Outstanding Graduate of Beijing (北京市优秀毕业生).
  • 05/2024: 🎉 Our work (EfficientTrain++) is accepted at TPAMI!
  • 05/2024: 🎉 Our work (SimPro) is accepted at ICML 2024.
  • 04/2024: 🎉 Our work (LAUDNet) is accepted at TPAMI!
  • 02/2024: 🎉 Two works (GSVA and Mask Grounding) are accepted at CVPR 2024.
  • 12/2023: 🎉 Our work (Learnable Semantic Data Augmentation) is accepted at TIP.
  • 10/2023: 🎉 Awarded by Comprehensive Excellence Scholarship (综合奖学金), Tsinghua University, 2023.
  • 07/2023: 🎉 Three works are accepted by ICCV 2023.
  • 10/2022: 🎉 Awarded by National Scholarship (国家奖学金), Ministry of Education of China.

📄 Selected Papers (Full publication list on Google Scholar)

Recent Works Representative Publications
Paper Image

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs [PDF]

Wangbo Zhao*, Yizeng Han*, Jiasheng Tang, Zhikai Li, Yibing Song, Kai Wang, Zhangyang Wang, Yang You

Arxiv Preprint, 2024.

We propose to use a small VLM to guide the visual token pruning in a large VLM. Meanwhile, the small VLM can also perform dynamic early exiting to further improve the inference efficiency.

Paper Image

Dynamic Diffusion Transformer [PDF]

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, Yang You

Arxiv Preprint, 2024.

We propose to dynamically adjust the computation of DiT in different timesteps and spatial locations of images. The computation of DiT-XL could be saved by 50% without sacrificing generation quality.

Paper Image

Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [PDF] [Code]

Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You

Conference on Neural Information Processing Systems (NeurIPS), 2024.

We propose to adapt static ViT to dynamic ViT via parameter-efficient fine-tuning without full-parameter tuning.

Paper Image

Demystify Mamba in Vision: A Linear Attention Perspective [PDF] [Code]

Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang

Conference on Neural Information Processing Systems (NeurIPS), 2024.

By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba’s success. Based on these findings, we propose a Mamba-Like Linear Attention (MLLA) model.

Paper Image

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [PDF] [Code]

Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang

Conference on Neural Information Processing Systems (NeurIPS), 2024.

We propose dynamic early exiting in Robot MLLMs.

Paper Image

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis [PDF] [Code]

Zanlin Ni, Yulin Wang, Renping Zhou, Yizeng Han, Jiayi Guo, Zhiyuan Liu, Yuan Yao, Gao Huang

Conference on Neural Information Processing Systems (NeurIPS), 2024.

We propose EfficientNAT(ENAT), a NAT model that explicitly encourages critical interactions inherent in non-autoregressive Transformers.

Paper Image

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [PDF]

Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li

European Conference on Computer Vision (ECCV), 2024

We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail.

Paper Image

Agent Attention: On the Integration of Softmax and Linear Attention [PDF] [Code]

Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang

European Conference on Computer Vision (ECCV), 2024

We propose Agent Attention, a linear attention mechanism in vision recognition and generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature.

Paper Image

Latency-aware Unified Dynamic Networks for Efficient Image Recognition [PDF] [Code] [将门创投]

Yizeng Han*, Zeyu Liu*, Zhihang Yuan*, Yifan Pu, Chaofei Wang, Shiji Song, Gao Huang

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, IF=24.314), 2024

We propose Latency-aware Unified Dynamic Networks (LAUDNet), a comprehensive framework that amalgamates three cornerstone dynamic paradigms—spatially-adaptive computation, dynamic layer skipping, and dynamic channel skipping—under a unified formulation.

Paper Image

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection [PDF] [Code]

Le Yang, Ziwei Zheng, Yizeng Han, Hao Cheng, Shiji Song, Gao Huang, Fan Li

European Conference on Computer Vision (ECCV), 2024

In the temporal action detection (TAD) task, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations.

Paper Image

SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning [PDF] [Code]

Chaoqun Du*, Yizeng Han*, Gao Huang

International Conference on Machine Learning (ICML), 2024

We focus on a realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data is unknown and mismatched. The proposed SimPro does not rely on any predefined assumptions about the distribution of unlabeled data.

Paper Image

GSVA: Generalized Segmentation via Multimodal Large Language Models [PDF] [Code]

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

We propose Generalized Segmentation Vision Assistant (GSVA) to address the issues of multi-object and empty-object in Generalized Referring Expression Segmentation (GRES).

Paper Image

Dynamic Neural Networks: A Survey [PDF] [智源社区] [机器之心-在线讲座] [Bilibili] [Slides]

Yizeng Han*, Gao Huang*, Shiji Song, Le Yang, Honghui Wang, Yulin Wang

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, IF=24.314), 2021

In this survey, we comprehensively review the rapidly developing area, dynamic neural networks. The important research problems, e.g., architecture design, decision making scheme, and optimization technique, are reviewed systematically. We also discuss the open problems in this field together with interesting future research directions.

Paper Image

Latency-aware Unified Dynamic Networks for Efficient Image Recognition [PDF] [Code] [将门创投]

Yizeng Han*, Zeyu Liu*, Zhihang Yuan*, Yifan Pu, Chaofei Wang, Shiji Song, Gao Huang

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, IF=24.314), 2024

We propose Latency-aware Unified Dynamic Networks (LAUDNet), a comprehensive framework that amalgamates three cornerstone dynamic paradigms—spatially-adaptive computation, dynamic layer skipping, and dynamic channel skipping—under a unified formulation.

Paper Image

Dynamic Perceiver for Efficient Visual Recognition [PDF] [Code] [Youtube]

Yizeng Han*, Dongchen Han*, Zeyu Liu, Yulin Wang, Xuran Pan, Yifan Pu, Chao Deng, Junlan Feng, Shiji Song, Gao Huang

IEEE/CVF International Conference on Computer Vision (ICCV), 2023

We propose Dynamic Perceiver (Dyn-Perceiver), a general framework which can be conveniently implemented on top of any visual backbones. It explicitly decouples feature extraction and early classification. We show that early classifiers can be constructed in the classification branch without harming the performance of the last classifier. Experiments demonstrate that Dyn-Perceiver significantly outperforms existing state-of-the-art methods in terms of the trade-off between accuracy and efficiency.

Paper Image

Learning to Weight Samples for Dynamic Early-exiting Networks [PDF] [Code] [Youtube]

Yizeng Han*, Yifan Pu*, Zihang Lai, Chaofei Wang, Shiji Song, Junfen Cao, Wenhui Huang, Chao Deng, Gao Huang

European Conference on Computer Vision (ECCV), 2022

We propose to bridge the gap between training and testing of dynamic early-exiting networks by sample weighting. By bringing the adaptive behavior during inference into the training phase, we show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency.

Paper Image

Latency-aware Spatial-wise Dynamic Networks [PDF] [Code]

Yizeng Han*, Zhihang Yuan*, Yifan Pu*, Chenhao Xue, Shiji Song, Guangyu Sun, Gao Huang

Conference on Neural Information Processing Systems (NeurIPS), 2022

We use a latency predictor to guide both algorithm design and scheduling optimization of spatial-wise dynamic networks on various hardware platforms. We show that "coarse-grained" spatially adaptive computation can effectively reduce the memory access cost and shows superior efficiency than pixel-level dynamic operations.

Paper Image

Resolution Adaptive Networks for Efficient Inference [PDF] [Code]

Le Yang*, Yizeng Han*, Xi Chen*, Shiji Song, Jifeng Dai, Gao Huang

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

We focus on the spatial redundancy of images, and propose a novel Resolution Adaptive Network (RANet), which is inspired by the intuition that low-resolution representations are sufficient for classifying “easy” inputs, while only some “hard” samples need spatially detailed information. Empirically, we demonstrate the effectiveness of the proposed RANet in both the anytime prediction setting and the budgeted batch classification setting.

Paper Image

Spatially Adaptive Feature Refinement for Efficient Inference [PDF]

Yizeng Han, Gao Huang, Shiji Song, Le Yang, Yitian Zhang, Haojun Jiang

IEEE Transactions on Image Processing (TIP, IF=11.041), 2021

We propose to perform efficient inference by adaptively fusing information from two branches: one conducts standard convolution on inputs at a lower resolution, and the other one selectively refines a set of regions at the original resolution. Experiments on classification, object detection and semantic segmentation validate that SAR can consistently improve the network performance and efficiency.

Paper Image

SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning [PDF] [Code]

Chaoqun Du*, Yizeng Han*, Gao Huang

International Conference on Machine Learning (ICML), 2024

We focus on a realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data is unknown and mismatched. The proposed SimPro does not rely on any predefined assumptions about the distribution of unlabeled data.

Paper Image

Fine-grained Recognition with Learnable Semantic Data Augmentation [PDF] [Code]

Yifan Pu*, Yizeng Han*, Yulin Wang, Junlan Feng, Chao Deng, Gao Huang

IEEE Transactions on Image Processing (TIP), 2023

We propose diversifying the training data at the feature space to alleviate the discriminative region loss problem in fine-grained image recognition. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a sample-wise covariance prediction network.

🎖 Awards

  • First Prize of CSIG Natural Science Award (中国图象图形学会自然科学奖一等奖), 第四完成人, 清华大学 & 阿里巴巴达摩院, 2024
  • Outstanding Doctoral Dissertation of Tsinghua University (清华大学优秀博士学位论文), Tsinghua University, 2024
  • Outstanding Graduate of Tsinghua University (清华大学优秀毕业生), Tsinghua University, 2024
  • Outstanding Graduate of Beijing (北京市优秀毕业生), Beijing Municipal Education Commission, 2024
  • Comprehensive Excellence Scholarship (综合奖学金), Tsinghua University, 2023
  • National Scholarship (国家奖学金), Ministry of Education of China, 2022
  • Comprehensive Excellence Scholarship (综合奖学金), Tsinghua University, 2017
  • Comprehensive Excellence Scholarship (综合奖学金), Tsinghua University, 2016
  • Academic Excellence Scholarship (学业优秀奖学金), Tsinghua University, 2015

📧 Contact

  • hanyizeng.hyz at alibaba-inc dot com
  • yizeng38 at gmail dot com