A Reliable Multimodal Method Considering Modality-Specific Subspace Learning

Abstract

Representation learning is critical for multimodal methods; traditional consistency-based multimodal methods always constrain the disagreements among different modality embeddings or predictions as an extra regularization. However, these methods may appear to cause performance degeneration in open environments. This is mainly attributed to the interference of asymmetric information, that is, different modality information exists divergence, whereas consistency regularization prefers to simply minimize the divergence rather than optimal classifiers. Therefore, it is unsafe to directly use consistency regularization. To this end, we propose modality-specific subspace learning (MSSL). It learns the modality-specific subspace representations by treating modality divergence and consistency separately. In particular, MSSL is a semi-supervised framework that maps different modality feature embeddings into shared and independent subspaces. The shared subspace applies reliable consistency regularization by measuring intermodality structural similarities. The independent subspace uses a discriminative modality-separation network to emphasize modality complementary information. Finally, labeled instances from different modalities are classified with weighted predictions over concatenated embeddings. Consequently, MSSL improves both the single modal and ensemble classification results and acquires more robust mapping among different modalities. Empirical studies show the superior performance of MSSL on real-world datasets.

Keywords

multimodal learning semi-supervised learning subspace learning reliable regularization

Get full access to this article

View all access options for this article.

References

Andrew

Arora

Bilmes

J. A.

Livescu

(2013). Deep canonical correlation analysis. In Proceedings of the international conference on machine learning, IOML’13, Atlanta, GA, USA (pp. 1247–1255). JMLR.

Brefeld

Gartner

Scheffer

Wrobel

(2006). Efficient co-regularised least squares regression. In Proceedings of the international conference on machine learning, ICML’06, Pittsburgh, Pennsylvania, USA (pp. 137–144). Association for Computing Machinery.

Busso

Bulut

Lee

C. C.

Kazemzadeh

Mower

Kim

Chang

J. N.

Lee

Narayanan

S. S.

(2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359.

Chen

Wang

Gao

Zhou

(2018). Tri-net for semi-supervised deep learning. In Proceedings of the international joint conference on artificial intelligence, IJCAI’18, Stockholm, Sweden (pp. 2014–2020). AAAI Press.

Chua

Tang

Hong

Luo

Zheng

(2009). NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of ACM international conference on image and video retrieval, CIVR’09, Santorini Island, Greece (Article No. 48, pp. 1–9). Association for Computing Machinery.

Cui

Che

Liu

Qin

Yang

Wang

(2019). Pre-training with whole word masking for Chinese BERT. arXiv preprint arXiv:1906.08101.

Faghri

Fleet

D. J.

Kiros

J. R.

Fidler

(2018). VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British machine vision conference (p. 12). BMVA Press.

Farquhar

J. D. R.

Hardoon

D. R.

Meng

Shawe-Taylor

Szedmak

(2005). Two view learning: SVM-2K, theory and practice. In Advances in neural information processing systems (pp. 355–362). MIT Press.

Goodfellow

I. J.

Pouget-Abadie

Mirza

Warde-Farley

Ozair

Courville

A. C.

Bengio

(2014). Generative adversarial networks. CoRR abs/1406.2661.

10.

Guo

Wang

(2019). Towards making co-training suffer less from insufficient views. Frontiers of Computer Science, 13(1), 99–105.

11.

Zhang

Ren

Sun

(2015). Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385.

12.

Hotelling

(1936). Relations between two sets of variates. Biometrika, 28(3/4), 321–377.

13.

Huiskes

M. J.

Lew

M. S.

(2008). The MIR Flickr retrieval evaluation. In Proceedings of ACM international conference on multimedia (pp. 39–43). Association for Computing Machinery.

14.

Iwata

Yamada

(2016). Multi-view anomaly detection via robust probabilistic latent variable models. In Advances in neural information processing systems (pp. 1136–1144). Curran Associates Inc.

15.

Jarvelin

Kekalainen

(2000). IR evaluation methods for retrieving highly relevant documents. In Proceedings of the annual international ACM SIGIR conference on research and development in information retrieval (pp. 41–48). Association for Computing Machinery.

16.

Zhao

Gao

(2025). Hybrid relational graphs with sentiment-laden semantic alignment for multimodal emotion recognition in conversation. In IJCAI (pp. 2973–2981). ijcai.org.

17.

Kingma

D. P.

(2015). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA.

18.

LeCun

Bengio

Hinton

G. E.

(2015). Deep learning. Nature, 521(7553), 436–444.

19.

Zhou

(2012). Diversity regularized ensemble pruning. In Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (pp. 330–345). Springer.

20.

Lin

Chen

Zhang

Yang

Lin

Liu

Cai

(2024). Tagclip: A local-to-global framework to enhance open-vocabulary multi-label classification of CLIP without training. In AAAI conference on artificial intelligence (pp. 3513–3521). AAAI Press.

21.

Liu

Huang

Zhang

(2017). Cross-modality binary code learning via fusion similarity hashing. In Proceedings of the conference on computer vision and pattern recognition (pp. 6345–6353). IEEE.

22.

Liu

Tong

Zhang

Duan

Xiong

(2019). Hydra: A personalized and context-aware multi-modal transportation recommendation system. In Proceedings of the international conference on knowledge discovery and data mining (pp. 2314–2324). Association for Computing Machinery.

23.

Zhang

Wan

Zhang

Pan

(2015). Robot and cloud-assisted multi-modal healthcare system. Cluster Computing, 18(3), 1295–1306.

24.

Muslea

Minton

Knoblockraig

(2003). Active learning with strong and weak views: A case study on wrapper induction. In Proceedings of the international joint conference on artificial intelligence (pp. 415–420). Morgan Kaufmann Publishers Inc.

25.

Ngiam

Khosla

Kim

Nam

Lee

A. Y.

(2011). Multimodal deep learning. In Proceedings of the international conference on machine learning (pp. 689–696). Omnipress.

26.

Nie

Cao

Ding

Zhou

(2022). A total variation with joint norms for infrared and visible image fusion. IEEE Transactions on Multimedia, 24, 1460–1472.

27.

Wang

Nie

Huang

(2013a). Multi-view clustering and feature learning via structured sparsity. In Proceedings of the international conference on machine learning (pp. 352–360). PMLR.

28.

Wang

Tan

(2016a). Joint feature selection and subspace learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10), 2010–2023.

29.

Wang

Tan

(2013b). Learning coupled feature spaces for cross-modal matching. In Proceedings of the IEEE international conference on computer vision (pp. 2088–2095). IEEE.

30.

Wang

Yin

Wang

(2016b). A comprehensive survey on cross-modal retrieval. CoRR abs/1607.06215.

31.

Wang

Zhou

(2013). Co-training with insufficient views. In Proceedings of the Asian conference on machine learning (pp. 467–482). PMLR.

32.

Wang

Guo

Lei

Zhang

S. Z.

(2017). Exclusivity-consistency regularized multi-view subspace clustering. In Proceedings of the conference on computer vision and pattern recognition (pp. 1–9). IEEE.

33.

Xie

Deng

Liu

Tao

(2020). Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Transaction Image Processing, 29, 3626–3637.

34.

Yang

Wan

Jiang

(2024). Facilitating multimodal classification via dynamically learning modality gap. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, & C. Zhang (Eds.), Advances in Neural Information Processing Systems. (pp. 62108–62122). Curran Associates, Inc.

35.

Yang

Wang

Zhan

Xiong

Jiang

(2019a). Comprehensive semi-supervised multi-modal learning. In Proceedings of the international joint conference on artificial intelligence (pp. 4092–4098). AAAI Press.

36.

Yang

Zhan

Liu

Jiang

(2018a). Complex object classification: A multi-modal multi-instance multi-label deep network with optimal transport. In Proceedings of the international conference on knowledge discovery and data mining (pp. 2594–2603). Association for Computing Machinery.

37.

Yang

Zhan

Liu

Jiang

(2019b). Deep robust unsupervised multi-modal network. In Proceedings of the AAAI conference on artificial intelligence (pp. 5652–5659). AAAI Press.

38.

Yang

Zhou

Tang

(2024b). Rebalanced vision-language retrieval considering structure-aware distillation. IEEE Transaction Image Processing, 33, 6881–6892.

39.

Yang

Zhan

Sheng

Jiang

(2018b). Semi-supervised multi-modal learning with incomplete modalities. In Proceedings of the international joint conference on artificial intelligence (pp. 2998–3004). AAAI Press.

40.

Zhang

Zhou

(2011). Co-trade: Confident co-training with data editing. IEEE Transactions on Systems, Man, and Cybernetics, 41(6), 1612–1626.

41.

Zhen

Wang

Peng

(2019). Deep supervised cross-modal retrieval. In IEEE conference on computer vision and pattern recognition (pp. 10394–10403). IEEE.

42.

Zhou

(2009). Ensemble learning. In Encyclopedia of biometrics (pp. 270–273). Springer US.