Latency-aware Unified Dynamic Networks for Efficient Image Recognition [PDF] [Code] [将门创投]
Yizeng Han*, Zeyu Liu*, Zhihang Yuan*, Yifan Pu, Chaofei Wang, Shiji Song, Gao Huang
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI, IF=24.314), 2024
We propose Latency-aware Unified Dynamic Networks (LAUDNet), a comprehensive framework that amalgamates three cornerstone dynamic paradigms—spatially-adaptive computation, dynamic layer skipping, and dynamic channel skipping—under a unified formulation.
Dynamic Diffusion Transformer [PDF]
Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao Huang, Fan Wang, Yang You
Arxiv Preprint, 2024.
We propose to dynamically adjust the computation of DiT in different timesteps and spatial locations of images. The computation of DiT-XL could be saved by 50% without sacrificing generation quality.
Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [PDF] [Code]
Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You
Conference on Neural Information Processing Systems (NeurIPS), 2024.
We propose to adapt static ViT to dynamic ViT via parameter-efficient fine-tuning without full-parameter tuning.
Demystify Mamba in Vision: A Linear Attention Perspective [PDF] [Code]
Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang
Conference on Neural Information Processing Systems (NeurIPS), 2024.
By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba’s success. Based on these findings, we propose a Mamba-Like Linear Attention (MLLA) model.
Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [PDF]
Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li
European Conference on Computer Vision (ECCV), 2024
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail.
Agent Attention: On the Integration of Softmax and Linear Attention [PDF] [Code]
Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Shiji Song, Gao Huang
European Conference on Computer Vision (ECCV), 2024
We propose Agent Attention, a linear attention mechanism in vision recognition and generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature.
DyFADet: Dynamic Feature Aggregation for Temporal Action Detection [PDF] [Code]
Le Yang, Ziwei Zheng, Yizeng Han, Hao Cheng, Shiji Song, Gao Huang, Fan Li
European Conference on Computer Vision (ECCV), 2024
In the temporal action detection (TAD) task, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations.
SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning [PDF] [Code]
Chaoqun Du*, Yizeng Han*, Gao Huang
International Conference on Machine Learning (ICML), 2024
We focus on a realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data is unknown and mismatched. The proposed SimPro does not rely on any predefined assumptions about the distribution of unlabeled data.
GSVA: Generalized Segmentation via Multimodal Large Language Models [PDF] [Code]
Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
We propose Generalized Segmentation Vision Assistant (GSVA) to address the issues of multi-object and empty-object in Generalized Referring Expression Segmentation (GRES).