As the expressive depth of an emotional face differs with individuals, expressions, or situations, recognizing an expression using a single facial image at a moment is difficult. One of the approaches to alleviate this difficulty is using a video-based method that utilizes multiple frames to extract temporal information between facial expression images. In this paper, we attempt to utilize a generative image that is estimated based on a given single image. Then, we propose to utilize a contrastive representation that explains an expression difference for discriminative purposes. The contrastive representation is calculated at the embedding layer of a deep network by comparing a single given image with a reference sample generated by a deep encoder-decoder network. Consequently, we deploy deep neural networks that embed a combination of a generative model, a contrastive model, and a discriminative model. In our proposed networks, we attempt to disentangle a facial expressive factor in two steps including learning of a reference generator network and learning of a contrastive encoder network. We conducted extensive experiments on three publicly available face expression databases (CK+, MMI, and Oulu-CASIA) that have been widely adopted in the recent literatures. The proposed method outperforms the known state-of-the art methods in terms of the recognition accuracy.