People detection in single 2D images has improved greatly in recent years. However, comparatively little of this progress has percolated into multi-camera multi- people tracking algorithms, whose performance still de- grades severely when scenes become very crowded. In this work, we introduce a new architecture that combines Con- volutional Neural Nets and Conditional Random Fields to explicitly model those ambiguities. One of its key ingredi- ents are high-order CRF terms that model potential occlu- sions and give our approach its robustness even when many people are present. Our model is trained end-to-end and we show that it outperforms several state-of-art algorithms on challenging scenes.