In this paper, we study the task of 3D human pose estimation in the wild. This task is challenging because existing benchmark datasets provide either 2D annotations in the wild or 3D annotations in controlled environments. We propose a weakly-supervised transfer learning method that learns an end-to-end network using training data with mixed 2D and 3D labels. The network augments a state-of-the-art 2D pose estimation network with a 3D depth regression network. Unlike previous approaches that train these two sub-networks in a sequential manner, we introduce a unified training method that fully exploits the correlation between these two sub-tasks and learns common feature representations. In doing so, the 3D pose labels in controlled environments are transferred to images in the wild that only possess 2D annotations. In addition, we introduce a 3D geometric constraint to regularize the prediction 3D poses, which is effective on images that only have 2D annotations. Our method leads to considerable performance gains and achieves competitive results on both 2D and 3D benchmarks. It produces high quality 3D human poses in the wild, without supervision of in-the-wild 3D data.