基于局部Attention和CTC融合的语音情感识别方法研究

SPEECH EMOTION RECOGNITION USING LOCAL ATTENTION MECHANISM AND CTC

摘要: 针对基于时间序列的语音情感识别方法难以计算情感帧携带的情感信息量的问题，提出一种局部注意力机制(LAM)和结合连接主义时间分类(CTC)融合的语音情感识别模型(LAM-CTC)。提取VGFCC情感特征作为共享编码器的输入；CTC层最小化代价损失并预测情感类别，LAM层使用局部注意力机制计算上下文向量；通过解码器对上下文向量进行解码；通过平均值法将解码结果融合得到情感预测结果。实验结果表明，提出的模型在IEMOCAP数据集上的UAR和WAR分别达到了68.1%和68.3%。

Abstract: Aimed at the problem that time series-based speech emotion recognition methods are difficult to calculate the amount of emotion information carried by emotion frames, a speech emotion recognition model (LAM-CTC) combined with local attention mechanism (LAM) and connectionist temporal classification (CTC) is proposed. The VGFCC emotional features were extracted as the input of the shared encoder. The CTC layer minimized the cost loss and predicted the emotional category, and the LAM layer used the local attention mechanism to calculate the context vector. The decoder decoded the context vector. The average method was used to fuse the decoding results to obtain the emotion prediction results. Experimental results show that the UAR and WAR of the proposed model on the IEMOCAP dataset reach 68.1% and 68.3%, respectively.