In the context of multimedia content, a modality can be defined as a type of data item such as text, images, music, and videos. Up to now, only limited research has been conducted on cross-modal retrieval of suitable music for a specified video or vice versa. Moreover, much of the existing research relies on metadata such as keywords, tags, or associated description that must be individually produced and attached posterior. This paper introduces a new content-based, cross-modal retrieval method for video and music that is implemented through deep neural networks. The proposed model consists of a two-branch network that extracts features from the two different modalities and embeds them into a single embedding space. We train the network via cross-modal ranking loss such that videos and music with similar semantics end up close together in the embedding space. In addition, to preserve inherent characteristics within each modality, the proposed single-modal structure loss was also used for training. Owing to the lack of a dataset to evaluate cross-modal video-music tasks, we constructed a large-scale video-music pair benchmark. Finally, we introduced reasonable quantitative and qualitative experimental protocols. The experimental results on our dataset are expected to be a baseline for subsequent studies of less-mature video-to-music and music-to video related tasks.