From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

From 2D Grids to 1D Tokens:
Reforming Shared Representations for Multimodal Image Fusion

Yuchen Xian^1,2, Yunqiu Xu³, Yang He^4,5, Yi Yang^1,2,*

¹ReLER, The State Key Lab of Brain Machine Intelligence, Zhejiang University
²College of Artificial Intelligence, Zhejiang University
³National University of Singapore
⁴CFAR, Agency for Science, Technology and Research, Singapore
⁵IHPC, Agency for Science, Technology and Research, Singapore
^*Indicates Corresponding Author

ICML 2026

Paper Code (coming soon) arXiv

Abstract

Multimodal image fusion (MMIF) aims to integrate complementary information from different modalities into a single fused image that preserves fine local details while maintaining globally consistent appearance. Most existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level appearance factors. To better optimize two objectives jointly, we redesign the shared representation by mapping inputs into a compact sequence of discrete 1D image tokens, and instantiate this design with TiTok as a lightweight tokenizer, decoupling the shared representation from fixed pixel locations and concentrating image-level attributes into a small set of global tokens. We propose Selective Token Editing (STE): we sparsely update only a small set of critical shared tokens, providing a lightweight token-level mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding complex loss designs. Experiments on multiple benchmarks show that our method delivers consistent, multi-metric improvements—enhancing global coherence and local fidelity simultaneously—and achieves the best overall performance under comprehensive evaluation.

Method Overview

Method pipeline: two-stage framework with 1D tokenizer and Selective Token Editing — **Figure 2.** Overview of our two-stage framework. Stage I establishes base/detail factorization via reconstruction warm-up. Stage II applies Selective Token Editing (STE) to steer global appearance during fusion training.

Experiments

We evaluate on two multimodal image fusion tasks and two downstream perception tasks:

Infrared-Visible Image Fusion (IVIF): Trained on MSRS (1,083 pairs); tested on M3FD (202 pairs), RoadScene (152 pairs), and TNO (30 pairs).
Medical Image Fusion (MIF): Evaluated on the Harvard Medical dataset.
Downstream tasks: Object detection on M3FD and semantic segmentation on FMB.

We compare against state-of-the-art methods including CDDFuse, SAGE, EMMA, DCEvo, and Text-DiFuse. Our method achieves the best overall performance under comprehensive multi-metric evaluation, delivering consistent improvements in global coherence and local fidelity simultaneously.

Qualitative comparison across infrared-visible and medical image fusion examples — **Figure 3.** Qualitative comparisons across infrared-visible and medical image fusion examples. Our method better preserves salient thermal targets, visible textures, and anatomical response details while maintaining natural global appearance compared to competing methods.

Downstream task comparison: object detection and semantic segmentation — **Figure 4.** Downstream task evaluation. Our fused images improve object detection (M3FD) and semantic segmentation (FMB) performance over competing methods.

BibTeX

@inproceedings{xian2026_1dtokens,
  title     = {From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion},
  author    = {Xian, Yuchen and Xu, Yunqiu and He, Yang and Yang, Yi},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026}
}