As generative models grow larger and handle more complex inputs across various modalities such as language, vision, and video, token-level computation becomes a significant bottleneck. Existing token selection methods often suffer from being static, modality-specific, or incompatible with autoregressive generation, limiting their effectiveness. QuickMerge++ addresses these challenges by introducing a lightweight token merging framework that dynamically selects a reduced number of tokens based on the magnitude of attention norms and uses an entropy-based budget estimator to guide this selection. To ensure compatibility with autoregressive generation, the method incorporates a lightweight transformer prior trained on the merged token sequences. This combination of semantic salience estimation, flexible token budgets, and autoregressive alignment allows QuickMerge++ to generate accurate outputs while processing fewer tokens. Evaluations across multiple modalities demonstrate that QuickMerge++ consistently improves the tradeoff between computational cost and accuracy. It significantly reduces token counts and outperforms both learned tokenizers and fixed-patch baselines in terms of performance. The approach offers a promising direction for scaling generative models more efficiently without sacrificing output quality. Potential implications include faster inference times and reduced resource consumption in large-scale generative tasks. Future work could explore further optimization of token budget estimation and broader applicability to other model architectures.
👉 Pročitaj original: arXiv AI Papers