arxivst stuff from arxiv that you should probably bookmark

Realistic Fake Data: How to Generate Images of Specific Objects

GANs (general adversarial networks) are useful because they can generate images of flowers or faces, and this latest refinement can generate images of specific flowers or faces (e.g. a daisy instead of just a “pink blobby thing with petals”). In order to do that, the authors combine a GAN and a VAE (variational autoencoder) and throw in some interesting techniques (asymmetric loss and a mapping between latent and real space) to get it all to work.

vae gan

Text-to-Speech from Google

Google’s new text-to-speech model, Tacotron, shows a lot of promise. We spent last night listening to the audio samples generated by it. In particular, the examples where the model performed punctuation inflections were really impressive. Tacotron’s architecture combines seq2seq methods with an attention mechanism. It’s trained from scratch on text-audio pairs without any specific feature engineering. Unfortunately, they use an internal dataset so you can’t go out an replicate their work today.

text-to-speech cant-reproduce

March 29 2017

Amazon’s New Dual-Encoder for Question-Answer Pair Selection

In a narrow domain, such as customer service, there are common recurring questions and answers. To take advantage of this repetition, you can train a NN to find appropriate question-answer pairs. Amazon presents a new method of training a simple dual-encoder to identify question-answer pairs based on the embeddings of past responses. The trained encoder then selects the appropriate response template. In their experiments, they found that the selected templates cover >70% of past customer inquiries.

amazon qa siamese

Count Objects in an Image with a Mixture of Experts

Counting objects in an image is a difficult task due to changes in both density and scale. To address this task, the authors use an ensembling method known as stacking to create a mixture of experts (MoE). Individual CNNs are trained on each sub-problem and the results are “gaited” with a final CNN which selects the best expert for the current sample.

cnn counting moe stacking

Visual Question Answering, an In-Depth Survey

This is a solid survey of visual question and answer (VQA) algorithms. It explores the creation of VQA datasets, gives an overview of current state-of-the-art, and provides a comprehensive analysis of the methods used for various sub-tasks. The authors begin with a simple multi-layer perceptron and build up to various state-of-the-art architectures. They also review the common pitfalls in the VQA problem domain.

vqa survey

March 28 2017

An Alternative to Fixed Convolution Architectures

Continuing the trend of more flexible convolutions, as seen in last week’s deformable convolution, the authors of this paper propose active convolution units (ACUs). During training an ACU learns the shape of the convolution, and can represent any fixed convolution. This technique shows promise because it allows you to skip hand-tuning some hyperparameters.


DNN Efficiency: Past, Present, and Future

Comprehensive review of deep-learning techniques. This paper starts with a brief history of deep learning and then jumps into comparisons of modern deep learning architectures and frameworks. It then explores hardware and architectural choices for efficient deep learning. Good paper to bookmark to get you up to speed with deep neural networks, if you’re not already

survey tutorial efficiency deep learning

Image Captioning Architectures—Best Practices

When creating a network to caption an image, feeding in the image features is an important step known as ‘binding’. Early binding results in a network that mixes image and language information, whereas late binding maintains a separation between the linguistic string and the image vector until just before prediction. This study investigates binding systematically and shows that adding image features as the last “word” of the caption prefix (late binding) results in the best outcome.

image captioning

New State of the Art for Face Detection with Significant Scale Differences

Detecting faces using deep learning is easy, detecting faces at different scales with the same neural network is hard. Using a novel combination of deep learning and classical learning the authors were able to detect facial regions at multiple scales. This multi-scale method sets a new state-of-the-art on the WIDER FACE dataset.

r-cnn face detection state-of-the-art

March 27 2017

Discover Abstract Actions with Reinforcement Learning

Hand-designing concrete actions for an RL agent may not be sufficient to complete some tasks. A new policy-gradient algorithm, Discovery of Deep Options (DDO) is effective for learning these abstract actions which accelerate learning and shows promising results, such as matching annotations on robot-assisted surgical videos and kinematics that match a human expert.

reinforcement learning

Identify Text in Real World Images using Deep Direct Regression

Using a new direct regression model the authors achieve a new state-of-the-art on the ICDAR2015 Incidental Scene Text benchmark. With an F1-measure of 81%, this model outperforms previous methods based on indirect regression such as Faster R-CNN.

state-of-the-art text icdar2015

Pixelwise Segmentation using Image Classifiers

Traditional image classifiers tend to identify whole-image labels. This paper uses Adversarial Erasing, a novel technique to iteratively remove regions from an image, forcing the classifier to learn from more of the image. This technique achieved state-of-the-art for segemntation from whole-image labels on the PASCAL VOC 2012 test set with a mIoU of 55.7%.

state-of-the-art pascal voc 2012 segmentation

March 24 2017

Applying Ensemble Method to Augment Image Quality Resilience

Create quality resilient deep neural networks by using a set of “expert” models. These experts are trained on different quality distortions and their output is weighed by a separate gating network which appropriately matches input distortions to the kinds that each expert is trained on.

ensemble deep learning image

One Shot Learning on Small Datasets with GANs and ResNets

Combine both GANs and ResNets to improve one shot learning on small datasets. This deep network beats out the best methods for one-shot and five-shot learning on the mini-Imagenet dataset.

deep learning gan one shot state-of-the-art

Utilize Saliency to Improve Video Classification

Instead of treating all information on a single video frame as equal, this experiment splits the image into salient (appliance and motion) and non-salient (background) and trains separate CNNs on those streams of information. This treatment gives state-of-the-art results on the UCF-101 and the CCV datasets.

cnn video state-of-the-art

State Of The Art Results

New Datasets

Recent Abstracts