arxivst stuff from arxiv that you should probably bookmark

Deep Learning for Content-Based, Cross-Modal Retrieval of Videos and Music

Abstract · Apr 22, 2017 07:40 ·

modalities kaist content video modal music retrieval cs-cv

Arxiv Abstract

  • Sungeun Hong
  • Woobin Im
  • Hyun S. Yang

In the context of multimedia content, a modality can be defined as a type of data item such as text, images, music, and videos. Up to now, only limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa. Moreover, much of the existing research relies on metadata such as keywords, tags, or associated description that must be individually produced and attached posterior. This paper introduces a new content-based, cross-modal retrieval method for video and music that is implemented through deep neural networks. The proposed model consists of a two-branch network that extracts features from the two different modalities and embeds them into a single embedding space. We train the network via cross-modal ranking loss such that videos and music with similar semantics end up close together in the embedding space. In addition, to preserve inherent characteristics within each modality, the proposed single-modal structure loss was also used for training. Owing to the lack of a dataset to evaluate cross-modal video-music tasks, we constructed a large-scale video-music pair benchmark. Finally, we introduced reasonable quantitative and qualitative experimental protocols. The experimental results on our dataset are expected to be a baseline for subsequent studies of less-mature video-to-music and music-to video related tasks.

Read the paper (pdf) »