Abstract:
Aimed at the problem of low recognition rate of video frames containing redundant information in dual-stream network, scSE (Spatial and Channel Squeeze & Excitation Block) and non-local operation are introduced based on two-stream network to construct SC_NLResNet behavior recognition framework. In this framework, the framework divided the video into equal and non-overlapping temporal segments and sparsely sampled each segment, extracting RGB frames and optical flow graphs as the input of the scSE module. The features processed by scSE were inputted into the non-local two-stream ResNet network, and the segmentations were merged to obtain the final prediction results. The experimental accuracy on UCF101 and Hmdb51 dataset reaches 96.9 % and 76.2 %, respectively. The results show that the combination of non-local operation and scSE module can enhance the information of feature space-time and between the channels to improve the accuracy, which verifies the effectiveness of SC_NLResNet network.