Owing to the success of deep learning techniques for tasks such as Q/A and text-based dialog, there is an increasing demand for AI agents in several domains such as retail, travel, entertainment, etc. that can carry on multimodal conversations with humans employing both text and images within a dialog seamlessly. However, deep learning research is this area has been limited primarily due to the lack of availability of large-scale, open conversation datasets. To overcome this bottleneck, in this paper we introduce the task of multi-modal, domain-aware conversations, and propose the MMD benchmark dataset to- wards this task. This dataset was gathered by working in close coordination with large number of domain experts in the retail domain and consists of over 150K conversation sessions between shoppers and sales agents. With this dataset, we propose 5 new sub-tasks for multimodal conversations along with their evaluation methodology. We also propose two novel multi-modal deep learning models in the encode- attend-decode paradigm and demonstrate their performance on two of the sub-tasks, namely text response generation and best image response selection. These experiments serve to establish baseline performance numbers and open new directions of research for each of these sub-tasks.