From 2D Grids to 1D Tokens:
Reforming Shared Representations for Multimodal Image Fusion

1ReLER, The State Key Lab of Brain Machine Intelligence, Zhejiang University
2College of Artificial Intelligence, Zhejiang University
3National University of Singapore
4CFAR, Agency for Science, Technology and Research, Singapore
5IHPC, Agency for Science, Technology and Research, Singapore

*Indicates Corresponding Author
ICML 2026

Abstract

Multimodal image fusion (MMIF) aims to integrate complementary information from different modalities into a single fused image that preserves fine local details while maintaining globally consistent appearance. Most existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level appearance factors. To better optimize two objectives jointly, we redesign the shared representation by mapping inputs into a compact sequence of discrete 1D image tokens, and instantiate this design with TiTok as a lightweight tokenizer, decoupling the shared representation from fixed pixel locations and concentrating image-level attributes into a small set of global tokens.

We propose Selective Token Editing (STE): we sparsely update only a small set of critical shared tokens, providing a lightweight token-level mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding complex loss designs. Experiments on multiple benchmarks show that our method delivers consistent, multi-metric improvements—enhancing global coherence and local fidelity simultaneously—and achieves the best overall performance under comprehensive evaluation.

Method Overview

Method pipeline: two-stage framework with 1D tokenizer and Selective Token Editing
Figure 2. Overview of our two-stage framework. Stage I establishes base/detail factorization via reconstruction warm-up. Stage II applies Selective Token Editing (STE) to steer global appearance during fusion training.

Experiments

We evaluate on two multimodal image fusion tasks and two downstream perception tasks:

  • Infrared-Visible Image Fusion (IVIF): Trained on MSRS (1,083 pairs); tested on M3FD (202 pairs), RoadScene (152 pairs), and TNO (30 pairs).
  • Medical Image Fusion (MIF): Evaluated on the Harvard Medical dataset.
  • Downstream tasks: Object detection on M3FD and semantic segmentation on FMB.

We compare against 9 state-of-the-art methods including CDDFuse, DDFM, LRRNet, Text-IF, TC-MoA, EMMA, and SAGE. Our method achieves the best overall performance under comprehensive multi-metric evaluation, delivering consistent improvements in global coherence and local fidelity simultaneously.

Qualitative comparison on M3FD, RoadScene, and TNO datasets
Figure 3. Qualitative comparisons on M3FD, RoadScene, and TNO datasets. Our method produces fused images with enhanced global coherence and sharper local structures compared to existing methods.
Downstream task comparison: object detection and semantic segmentation
Figure 4. Downstream task evaluation. Our fused images improve object detection (M3FD) and semantic segmentation (FMB) performance over competing methods.

BibTeX

@inproceedings{xian2026_1dtokens,
  title     = {From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion},
  author    = {Xian, Yuchen and Xu, Yunqiu and He, Yang and Yang, Yi},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026}
}