ProcNets: Learning to Segment Procedures in Untrimmed and Unconstrained Videos

  • Luowei Zhou
  • Chenliang Xu
  • Jason J. Corso

We propose a temporal segmentation and procedure learning model for long untrimmed and unconstrained videos, e.g., videos from YouTube. The proposed model segments a video into segments that constitute a procedure and learns the underlying temporal dependency among the procedure segments. The output procedure segments can be applied for other tasks, such as video description generation or activity recognition. Two aspects distinguish our work from the existing literature. First, we introduce the problem of learning long-range temporal structure for procedure segments within a video, in contrast to the majority of efforts that focus on understanding short-range temporal structure. Second, the proposed model segments an unseen video with only visual evidence and can automatically determine the number of segments to predict. For evaluation, there is no large-scale dataset with annotated procedure steps available. Hence, we collect a new cooking video dataset, named YouCookII, with the procedure steps localized and described. Our ProcNets model achieves state-of-the-art performance in procedure segmentation.

