Abstract:
Aimed at the problem that time series-based speech emotion recognition methods are difficult to calculate the amount of emotion information carried by emotion frames, a speech emotion recognition model (LAM-CTC) combined with local attention mechanism (LAM) and connectionist temporal classification (CTC) is proposed. The VGFCC emotional features were extracted as the input of the shared encoder. The CTC layer minimized the cost loss and predicted the emotional category, and the LAM layer used the local attention mechanism to calculate the context vector. The decoder decoded the context vector. The average method was used to fuse the decoding results to obtain the emotion prediction results. Experimental results show that the UAR and WAR of the proposed model on the IEMOCAP dataset reach 68.1% and 68.3%, respectively.