Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

UniPruneBench

Can Visual Input Be Compressed?
A Visual Token Compression Benchmark
for Large Multimodal Models

Introduction

Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across 6 ability dimensions and 10 datasets, covering 10 representative compression algorithms and 3 families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.

UniPruneBench

Overview of UniPruneBench, along with experimental results for representative pruning methods across various data scenarios.

Figure 2 presents the UniPruneBench taxonomy of visual token pruning methods, categorized into ViT-only, LLM-only, and Hybrid approaches.

Results and Findings

Random pruning consistently outperforms several well-designed methods, such as FitPrune, GPrune, VTW, and PruMerge. For instance, on InternVL3-8B and Qwen2.5-VL-7B, FitPrune performs worse than random pruning across all pruning ratios. On LLaVA-v1.5-7B, six out of eight perform worse than random pruning at 66.7% and 77.8% pruning ratios. This unexpected result highlights the limitation of current designs and suggests that more effective pruning strategies are needed beyond naive baselines.

No approach dominates across all models and pruning ratios. DivPrune achieves the best results on both Qwen2.5-VL-7B and InternVL3-8B under all ratios. However, on LLaVA-v1.5-7B, SparseVLM surpasses DivPrune under light pruning ratios, while DivPrune regains superiority under more aggressive pruning. This indicates that performance strongly depends on both the model architecture and the pruning level.

Among the three categories of methods, hybrid-based approaches achieve the best results on LLaVA-v1.5-7B at the 77.8% and 66.7% pruning ratios, though they perform worse at the 88.9% ratio. On InternVL3-8B and Qwen2.5-VL-7B, ViT-only methods (e.g., DivPrune) consistently outperform LLM-only methods (e.g., FitPrune), suggesting that vision-side pruning is more effective than language-side pruning.

Most benchmarks show accuracy degradation as pruning intensifies. However, instruction-following tasks (e.g., MIA) exhibit improvements in some cases. For example, on InternVL3-8B, DivPrune raises accuracy from 72.22% to 79.82%. We hypothesize that pruning increases the relative weight of textual inputs, thereby enhancing instruction adherence. In contrast, OCR tasks are highly sensitive to pruning: as more visual tokens are removed, crucial details are lost, leading to rapid performance decline.

Light pruning leads to moderate degradation, while aggressive pruning causes substantial drops. For example, on Qwen2.5-VL-7B, the average accuracy decreases from 57.5% at 33% tokens to 50.1% at 11% tokens under random pruning. Similarly, on InternVL3-8B, DivPrune maintains 67.58% at 22% tokens but falls to 64.04% at 11% tokens. Notably, DivPrune consistently achieves the best results under the highest pruning ratio (88.9%), showing stronger robustness in extreme scenarios.

Analysis of Exploratory Results

To investigate the sensitivity of token compression techniques to model scale, we evaluate three representative methods, DivPrune, GPrune, and FastV, across two variants of InternVL: InternVL3-1B (small) and InternVL3-8B (large). As shown in Fig 3, scaling up the base model consistently yields significant accuracy gains across nearly all benchmarks under all compression methods, confirming that larger models retain more semantic capacity even after token reduction. The results indicate that larger architectures provide greater robustness to token reduction, suggesting that compression strategies should be evaluated across scales rather than in isolation.

Considering real-world scenarios, we also evaluate the running time of different pruning methods. We profile three nested intervals: Total time, the elapsed time to finish the entire dataset; Prefill time, the single encoder forward pass that computes keys and values for all visual and textual tokens before any decoding starts, a phase that is compute-bound for the large model; and Method time, the GPU milliseconds spent only on the compression subroutine (token scoring, selection and tensor re-layout). All measurements were collected on an NVIDIA A100-40 GB GPU with batch size = 1 and three independent runs. All reported methods correspond to a uniform pruning rate of 88.9% on the MME benchmark. The results in Table 4 show that the last component never exceeds 0.5s, less than 0.12 % of the corresponding total. So the cost of importance estimation is negligible. Pruning therefore exerts its effect entirely within the prefill: DivPrune and GPrune shorten it from 320 s to 185 s and 167 s, delivering 1.73–1.92× encoder acceleration and an overall 1.62–1.68× end-to-end speed-up versus the vanilla model.

Case Studies

BibTeX

@article{peng2025can,
  title={Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models},
  author={Peng, Tianfan and Du, Yuntao and Ji, Pengzhou and Dong, Shijie and Jiang, Kailin and Ma, Mingchuan and Tian, Yijun and Bi, Jinhe and Li, Qian and Du, Wei and others},
  journal={arXiv preprint arXiv:2511.02650},
  year={2025}
}

UniPruneBench