PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

Wei Wu^*¹, Xiaomeng Fan^*¹, Yuwei Wu^1,2, Zhi Gao^1,2, Pengxiang Li¹, Yunde Jia^2,1, Mehrtash Harandi³

¹ Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology,
Beijing Institute of Technology
² Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University
³ Department of Electrical and Computer System Engineering, Monash University

ICLR 2026 (Poster)
^*Indicates Equal Contribution

Code arXiv

Abstract

Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures.

To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

Our Method

Experimental Results

TOS classification results on the 1-shot and 16-shot settings. We bold the best results.

K-Shot	Base Method	Variant	Cifar100			SUN			ImageNet			Rare Species
K-Shot	Base Method	Variant	LA	HCA	MTA	LA	HCA	MTA	LA	HCA	MTA	LA	HCA	MTA
1	MaPLe	Vanilla	68.75	4.65	50.60	63.98	25.15	50.31	68.91	2.97	48.16	41.55	5.09	44.75
1	MaPLe	+ProTeCt	69.33	48.10	83.36	64.29	50.45	76.73	66.16	20.44	85.18	39.92	13.22	70.04
1	MaPLe	+Ours	71.37	53.19	85.29	67.57	57.92	80.55	66.33	25.56	85.98	46.77	20.94	76.83
1	PromptSRC	Vanilla	72.48	14.36	51.91	70.58	42.14	57.19	68.82	4.46	54.10	45.39	6.72	44.72
1	PromptSRC	+ProTeCt	73.07	49.54	85.16	70.61	55.52	78.73	68.43	21.58	85.63	44.56	20.36	74.42
1	PromptSRC	+Ours	73.54	51.91	85.76	70.64	57.79	79.94	68.86	25.13	86.45	46.98	23.03	77.32
16	MaPLe	Vanilla	75.01	17.54	52.21	71.86	33.25	54.29	70.70	4.15	48.16	50.94	5.30	40.41
16	MaPLe	+ProTeCt	75.34	61.15	88.04	72.17	59.71	82.27	69.52	31.24	87.87	48.14	24.82	78.79
16	MaPLe	+Ours	77.92	69.38	90.89	75.47	68.67	86.02	71.41	43.79	88.78	69.96	53.65	87.27
16	PromptSRC	Vanilla	77.71	15.07	56.86	75.75	45.23	59.42	71.50	2.48	46.71	59.20	11.64	55.82
16	PromptSRC	+ProTeCt	78.76	66.74	90.79	75.54	66.01	84.75	70.98	32.89	88.31	56.40	33.92	82.47
16	PromptSRC	+Ours	78.90	68.47	91.12	76.54	69.18	86.20	71.67	42.26	89.64	67.38	50.77	87.60

Learned Representations

Our method demonstrates improved feature separability across taxonomic categories.

Learned feature representations with improved separability across taxonomic categories

Attention Visualization

Our model adaptively generates semantic-aware visual features by attending to different regions corresponding to each taxonomic granularity (from coarse to fine, left to right).

Baseline

Ours

Our model's semantic-aware attention visualization across taxonomic granularity

Citation & BibTeX

If you find our work useful, please cite:

@inproceedings{wu2026hypmodalalign,
  title={Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds},
  author={Wu, Wei and Fan, Xiaomeng and Wu, Yuwei and Gao, Zhi and Li, Pengxiang and Jia, Yunde and Harandi, Mehrtash},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026}
}

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3