A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs [PDF]
Wangbo Zhao*, Yizeng Han*, Jiasheng Tang, Zhikai Li, Yibing Song, Kai Wang, Zhangyang Wang, Yang You
Arxiv Preprint, 2024.
We propose to use a small VLM to guide the visual token pruning in a large VLM. Meanwhile, the small VLM can also perform dynamic early exiting to further improve the inference efficiency.
Dynamic Diffusion Transformer [PDF]
Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, Yang You
Arxiv Preprint, 2024.
We propose to dynamically adjust the computation of DiT in different timesteps and spatial locations of images. The computation of DiT-XL could be saved by 50% without sacrificing generation quality.
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [PDF] [Code]
Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You
Conference on Neural Information Processing Systems (NeurIPS), 2024.
We propose to adapt static ViT to dynamic ViT via parameter-efficient fine-tuning without full-parameter tuning.
Demystify Mamba in Vision: A Linear Attention Perspective [PDF] [Code]
Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang
Conference on Neural Information Processing Systems (NeurIPS), 2024.
By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba’s success. Based on these findings, we propose a Mamba-Like Linear Attention (MLLA) model.
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [PDF] [Code]
Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang
Conference on Neural Information Processing Systems (NeurIPS), 2024.
We propose dynamic early exiting in Robot MLLMs.
ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis [PDF] [Code]
Zanlin Ni, Yulin Wang, Renping Zhou, Yizeng Han, Jiayi Guo, Zhiyuan Liu, Yuan Yao, Gao Huang
Conference on Neural Information Processing Systems (NeurIPS), 2024.
We propose EfficientNAT(ENAT), a NAT model that explicitly encourages critical interactions inherent in non-autoregressive Transformers.
Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [PDF]
Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li
European Conference on Computer Vision (ECCV), 2024
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail.
Agent Attention: On the Integration of Softmax and Linear Attention [PDF] [Code]
Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang
European Conference on Computer Vision (ECCV), 2024
We propose Agent Attention, a linear attention mechanism in vision recognition and generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature.
Latency-aware Unified Dynamic Networks for Efficient Image Recognition [PDF] [Code] [将门创投]
Yizeng Han*, Zeyu Liu*, Zhihang Yuan*, Yifan Pu, Chaofei Wang, Shiji Song, Gao Huang
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, IF=24.314), 2024
We propose Latency-aware Unified Dynamic Networks (LAUDNet), a comprehensive framework that amalgamates three cornerstone dynamic paradigms—spatially-adaptive computation, dynamic layer skipping, and dynamic channel skipping—under a unified formulation.
DyFADet: Dynamic Feature Aggregation for Temporal Action Detection [PDF] [Code]
Le Yang, Ziwei Zheng, Yizeng Han, Hao Cheng, Shiji Song, Gao Huang, Fan Li
European Conference on Computer Vision (ECCV), 2024
In the temporal action detection (TAD) task, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations.
SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning [PDF] [Code]
Chaoqun Du*, Yizeng Han*, Gao Huang
International Conference on Machine Learning (ICML), 2024
We focus on a realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data is unknown and mismatched. The proposed SimPro does not rely on any predefined assumptions about the distribution of unlabeled data.
GSVA: Generalized Segmentation via Multimodal Large Language Models [PDF] [Code]
Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
We propose Generalized Segmentation Vision Assistant (GSVA) to address the issues of multi-object and empty-object in Generalized Referring Expression Segmentation (GRES).