GANs (general adversarial networks) are useful because they can generate images of flowers or faces, and this latest refinement can generate images of specific flowers or faces (e.g. a daisy instead of just a “pink blobby thing with petals”). In order to do that, the authors combine a GAN and a VAE (variational autoencoder) and throw in some interesting techniques (asymmetric loss and a mapping between latent and real space) to get it all to work.
Google’s new text-to-speech model, Tacotron, shows a lot of promise. We spent last night listening to the audio samples generated by it. In particular, the examples where the model performed punctuation inflections were really impressive. Tacotron’s architecture combines seq2seq methods with an attention mechanism. It’s trained from scratch on text-audio pairs without any specific feature engineering. Unfortunately, they use an internal dataset so you can’t go out an replicate their work today.
March 29 2017
In a narrow domain, such as customer service, there are common recurring questions and answers. To take advantage of this repetition, you can train a NN to find appropriate question-answer pairs. Amazon presents a new method of training a simple dual-encoder to identify question-answer pairs based on the embeddings of past responses. The trained encoder then selects the appropriate response template. In their experiments, they found that the selected templates cover >70% of past customer inquiries.
Counting objects in an image is a difficult task due to changes in both density and scale. To address this task, the authors use an ensembling method known as stacking to create a mixture of experts (MoE). Individual CNNs are trained on each sub-problem and the results are “gaited” with a final CNN which selects the best expert for the current sample.
This is a solid survey of visual question and answer (VQA) algorithms. It explores the creation of VQA datasets, gives an overview of current state-of-the-art, and provides a comprehensive analysis of the methods used for various sub-tasks. The authors begin with a simple multi-layer perceptron and build up to various state-of-the-art architectures. They also review the common pitfalls in the VQA problem domain.
March 28 2017
Continuing the trend of more flexible convolutions, as seen in last week’s deformable convolution, the authors of this paper propose active convolution units (ACUs). During training an ACU learns the shape of the convolution, and can represent any fixed convolution. This technique shows promise because it allows you to skip hand-tuning some hyperparameters.
Comprehensive review of deep-learning techniques. This paper starts with a brief history of deep learning and then jumps into comparisons of modern deep learning architectures and frameworks. It then explores hardware and architectural choices for efficient deep learning. Good paper to bookmark to get you up to speed with deep neural networks, if you’re not already
When creating a network to caption an image, feeding in the image features is an important step known as ‘binding’. Early binding results in a network that mixes image and language information, whereas late binding maintains a separation between the linguistic string and the image vector until just before prediction. This study investigates binding systematically and shows that adding image features as the last “word” of the caption prefix (late binding) results in the best outcome.
Detecting faces using deep learning is easy, detecting faces at different scales with the same neural network is hard. Using a novel combination of deep learning and classical learning the authors were able to detect facial regions at multiple scales. This multi-scale method sets a new state-of-the-art on the WIDER FACE dataset.
March 27 2017
Hand-designing concrete actions for an RL agent may not be sufficient to complete some tasks. A new policy-gradient algorithm, Discovery of Deep Options (DDO) is effective for learning these abstract actions which accelerate learning and shows promising results, such as matching annotations on robot-assisted surgical videos and kinematics that match a human expert.
Using a new direct regression model the authors achieve a new state-of-the-art on the ICDAR2015 Incidental Scene Text benchmark. With an F1-measure of 81%, this model outperforms previous methods based on indirect regression such as Faster R-CNN.
Traditional image classifiers tend to identify whole-image labels. This paper uses Adversarial Erasing, a novel technique to iteratively remove regions from an image, forcing the classifier to learn from more of the image. This technique achieved state-of-the-art for segemntation from whole-image labels on the PASCAL VOC 2012 test set with a mIoU of 55.7%.
March 24 2017
Create quality resilient deep neural networks by using a set of “expert” models. These experts are trained on different quality distortions and their output is weighed by a separate gating network which appropriately matches input distortions to the kinds that each expert is trained on.
Combine both GANs and ResNets to improve one shot learning on small datasets. This deep network beats out the best methods for one-shot and five-shot learning on the mini-Imagenet dataset.
Instead of treating all information on a single video frame as equal, this experiment splits the image into salient (appliance and motion) and non-salient (background) and trains separate CNNs on those streams of information. This treatment gives state-of-the-art results on the UCF-101 and the CCV datasets.
State Of The Art Results
- Apr 25 End to End Module Networks
- Apr 13 General Approach to Real World Text Extraction
- Apr 13 MAGAN, Better than BEGAN
- Apr 12 New SOTA for VQA 1.0
- Apr 11 Predicting Recomendations with TransNets
- Apr 7 Use Machine Learning to Write Your Code For You
- Apr 5 Build A Faster Image Search
- Apr 5 2D to 3D Depth in Noisy Environments
- Apr 4 New State of The Art on Keyphrase Boundary Classification
- Apr 4 New State of the Art In Semantic Role Labeling
- Apr 21 SREFI: Synthesis of Realistic Example Face Images
- Apr 15 Neural Paraphrase Identification of Questions with Noisy Pretraining
- Apr 13 3d Point Cloud Dataset and Benchmark
- Apr 10 Loss Max-Pooling for Semantic Image Segmentation
- Apr 9 BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis
- Apr 6 A Low Altitude Geo-Referenced Drone Dataset
- Apr 3 Auto-Encode Your Way to Realistic Images
- Mar 21 Boost Your Cross-Media Retrieval Process with Twitter100k
- Mar 21 Counterfactual Fairness: Combat the Inherent Social Biases of Your Dataset
- Apr 24 Accelerated Nearest Neighbor Search with Quick ADC
- Apr 24 Consistency of community detection in multi-layer networks using spectral and matrix factorization methods
- Apr 24 A Saddle Point Approach to Structured Low-rank Matrix Learning in Large-scale Applications
- Apr 24 Detecting and Recognizing Human-Object Interactions
- Apr 24 A Trie-Structured Bayesian Model for Unsupervised Morphological Segmentation
- Apr 24 Accurate Optical Flow via Direct Cost Volume Processing
- Apr 24 Elite Bases Regression: A Real-time Algorithm for Symbolic Regression
- Apr 24 A Real-time Hand Gesture Recognition and Human-Computer Interaction System
- Apr 24 Measuring the Accuracy of Object Detectors and Trackers
- Apr 24 Joint Modeling of Text and Acoustic-Prosodic Cues for Neural Parsing
- Apr 24 Fast PET reconstruction using Multi-scale Fully Convolutional Neural Networks
- Apr 24 Supervised Adversarial Networks for Image Saliency Detection
- Apr 24 Automatic Liver Lesion Segmentation Using A Deep Convolutional Neural Network Method
- Apr 24 Learning from Comparisons and Choices
- Apr 24 Entropic Trace Estimates for Log Determinants
- Apr 24 Turing at SemEval-2017 Task 8: Sequential Approach to Rumour Stance Classification with Branch-LSTM
- Apr 24 What is the Essence of a Claim? Cross-Domain Claim Identification
- Apr 24 Reinforcement Learning Based Dynamic Selection of Auxiliary Objectives with Preserving of the Best Found Solution
- Apr 24 Stochastic Constraint Programming as Reinforcement Learning
- Apr 24 Monocular Visual Odometry with a Rolling Shutter Camera