A GRAPH ATTENTION NETWORK FOR VISUAL COMMON SENSE REASONING
-
Abstract
Visual common sense reasoning (VCR) is a challenging multimodal task proposed in recent years. In order to reason the semantic relationship in images and improve the performance of the VCR task, a graph attention network for visual common sense reasoning is proposed. The method encoded the visual objects for various images as visual nodes in the image and used the graph attention network to model the features of visual nodes and adjacent nodes to obtain the internal associations between the objects. In addition, the method effectively captured the dynamic interaction between visual objects and further improved the understanding of image semantics. Experiments on the VCR dataset show that the performance of the method on the three sub-tasks of VCR is improved.
-
-