Abstract:
In recent years, the research and development of deep learning for multi-mode interaction has attracted extensive attention, among which multi-mode pretraining model is indispensable. However, experiments show that most of these large models are not suitable for single-mode scenarios, and require a large number of multi-mode aligned corpora training which is difficult to obtain, and the number of parameters is too large to deploy. Therefore, this paper proposes a lightweight modal co-encoder MIBERT, which does not need to align multi-mode corpora and focuses on single-mode scenarios. In order to train MIBERT, MJ-KD was designed. The pre-training model Bertlarge and ResNet152 were used as teacher models, and their knowledge was transferred to MIBERT by MJ-KD. Experimental results show that the performance of the proposed MIBERT is equal to or better than that of the benchmark model on multiple tasks in image and text single-modality scenarios.