In this paper, we present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a reasoning manner. We introduce two video datasets VRF-AG and VRF-VidOR, and present a Graph Convolutional Transformer framework, which captures both object-level and frame-level dependencies.
Read more...