基于R(2+1)D时空特征融合与注意力的行为识别方法

BEHAVIOR RECOGNITION METHOD BASED ON R(2+1)D SPATIO-TEMPORAL FEATURE FUSION WITH ATTENTION

  • 摘要: 针对3D卷积在人体行为识别任务中,连续视频帧图像的时空信息提取不足且跨通道交互信息关注度不够,导致识别准确率不高的问题,提出一种基于R(2+1)D网络的多分路时空信息融合与注意力的行为识别方法。提取视频帧图像进行数据增强;以R(2+1)D网络为基础框架并融入Inception思想,对输入的视频帧图像进行多路时空特征卷积并融合,利用ECA通道注意力对融合特征筛选跨通道交互信息,以提取更抽象的高层特征;进行分类,输出人体行为识别结果。该方法充分利用视频的时空特征和跨通道交互信息,在UCF101数据集上准确率达到94.71%,比基础R(2+1)D网络提高4.53百分点;且模型参数由原来的33.3×10⁶减小为26.9×10⁶。实验表明,该方法能有效提高人体行为识别的准确率。

     

    Abstract: To address the problem of insufficient extraction of Spatio-temporal information from continuous video frame images and insufficient attention to cross-channel interaction information in 3D convolution in a human behavior recognition task, a behavior recognition method based on R(2+1)D network with multi-partition spatio-temporal information fusion and attention is proposed. The video frame images were extracted for data enhancement. The R(2+1)D network was used as the basic framework and incorporated with the Inception idea to convolve and fuse the input video frame images with multiple Spatio-temporal features, and the fused features were screened for cross-channel interaction information using ECA channel attention to extract more abstract high-level features. The classification was performed and the human behavior recognition results were output. The method made full use of the Spatio-temporal features and cross-channel interaction information of the video, and achieved an accuracy of 94.71% on the UCF101 dataset, which was 4.53 percentage points higher than the basic R(2+1)D network; and the model parameters were reduced from 33.3M to 26.9M. Experiments show that the method can effectively improve the accuracy of human behavior recognition.

     

/

返回文章
返回