基于变分蒸馏的模态联合表示学习

张亚伟; 王晶晶; 李嘉贤; 周萌南

doi:10.3969/j.issn.1000-386x.2025.05.016

基于变分蒸馏的模态联合表示学习

MODAL JOINT REPRESENTATION LEARNING BASED ON VARIATIONAL DISTILLATION

摘要

摘要: 近年来，多模态交互的深度学习研究和发展得到了广泛关注，其中多模态预测模型不可或缺。但是实验表明这些大模型大多不适合单模态场景，同时需要大量难以获取的多模态对齐语料训练，并且参数量巨大难以部署，因此提出一种轻量级的模态联合编码器MIBERT，不需要对齐多模态语料且专注于单模态场景。为了训练MIBERT，设计模态联合蒸馏MJ-KD，把预测模型Bertlarge和ResNet152作为教师模型，通过MJ-KD将其知识迁移到MIBERT。实验结果表明，所提出的MIBERT分别在图片和文本单模态场景中的多个任务上取得了与基准模型匹配或更好的性能。

Abstract: In recent years, the research and development of deep learning for multi-mode interaction has attracted extensive attention, among which multi-mode pretraining model is indispensable. However, experiments show that most of these large models are not suitable for single-mode scenarios, and require a large number of multi-mode aligned corpora training which is difficult to obtain, and the number of parameters is too large to deploy. Therefore, this paper proposes a lightweight modal co-encoder MIBERT, which does not need to align multi-mode corpora and focuses on single-mode scenarios. In order to train MIBERT, MJ-KD was designed. The pre-training model Bertlarge and ResNet152 were used as teacher models, and their knowledge was transferred to MIBERT by MJ-KD. Experimental results show that the performance of the proposed MIBERT is equal to or better than that of the benchmark model on multiple tasks in image and text single-modality scenarios.

HTML全文

参考文献(0)

施引文献

资源附件(0)