Close menu

SURE

Sunderland Repository records the research produced by the University of Sunderland including practice-based research and theses.

Disentangled Image-Text Classification: Enhancing Visual Representations with MLLM-driven Knowledge Transfer

Shuai, Qianjun, Chen, Xiaohao, Cheng, Yongqiang, Fang, Miao and Jin, Libiao (2026) Disentangled Image-Text Classification: Enhancing Visual Representations with MLLM-driven Knowledge Transfer. Expert Systems with Applications, 304. p. 130790.

Item Type: Article

Abstract

Multimodal image-text classification plays a critical role in applications such as content moderation, news recommendation, and multimedia understanding. Despite recent advances, visual modality faces higher representation learning complexity than textual modality in semantic extraction, which often leads to a semantic gap between visual and textual representations. In addition, conventional fusion strategies introduce cross-modal redundancy, further limiting classification performance. To address
these issues,we proposeMD-MLLM, a novel image-text classification framework that leverages large multimodal language models (MLLMs) to generate semantically enhanced visual representations.
To mitigate redundancy introduced by direct MLLM feature integration, we introduce a hierarchical disentanglement mechanism based on the Hilbert-Schmidt Independence Criterion (HSIC) and orthogonality constraints, which explicitly separates modality-specific and shared representations. Furthermore, a hierarchical fusion strategy combines original unimodal features with disentangled shared semantics, promoting discriminative feature learning and cross-modal complementarity. Extensive experiments on two benchmark datasets, N24News and Food101, show that MD-MLLM achieves consistently stable improvements in classification accuracy and exhibits competitive performance compared with various representative multimodal baselines. The framework also demonstrates good generalization ability and robustness across different multimodal scenarios. The code is available at https://github.com/xiaohaochen0308/MD-MLLM.

[thumbnail of MD_MLLM(eswa) (Clean author submitted).pdf] PDF
MD_MLLM(eswa) (Clean author submitted).pdf - Accepted Version
Restricted to Repository staff only until 18 December 2027.
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (15MB) | Request a copy

More Information

Additional Information: The code is available at https://github.com/xiaohaochen0308/MD-MLLM.
Related URLs:
Depositing User: Yongqiang Cheng

Identifiers

Item ID: 19756
Identification Number: 10.1016/j.eswa.2025.130790
URI: https://sure.sunderland.ac.uk/id/eprint/19756
Official URL: https://www.sciencedirect.com/science/article/abs/...

Users with ORCIDS

ORCID for Qianjun Shuai: ORCID iD orcid.org/0000-0003-0062-0077
ORCID for Xiaohao Chen: ORCID iD orcid.org/0009-0002-1219-6488
ORCID for Yongqiang Cheng: ORCID iD orcid.org/0000-0001-7282-7638

Catalogue record

Date Deposited: 23 Dec 2025 07:39
Last Modified: 23 Dec 2025 07:39

Contributors

Author: Qianjun Shuai ORCID iD
Author: Xiaohao Chen ORCID iD
Author: Yongqiang Cheng ORCID iD
Author: Miao Fang
Author: Libiao Jin

University Divisions

Faculty of Business and Technology > School of Computer Science and Engineering

Subjects

Computing > Data Science
Computing > Artificial Intelligence

Actions (login required)

View Item (Repository Staff Only) View Item (Repository Staff Only)

Downloads per month over past year