New SOTA for VQA 1.0
Post · Apr 12, 2017 03:14 ·
There’s a lot of research around VQA right now. This new paper gets a new state-of-the-art on VQA 1, an improvement of 0.4%. It’s by a small margin, but it’s also using a simpler architecture.
Highlights From the Paper
- We treat visual question answering task as a classification problem. Given an image (i) and a question (q) in the form of natural language we want to estimate the most likely answer a from a fixed set of answers based on the content of the image.
- For every question (in VQA 2.0) there are two images in the dataset that result in two different answers to the question.
- Relatively simple architecture (compared to the recent works) when trained carefully bests state the art.
- This paper proves once again that when it comes to training neural networks the devil is in the details.
This paper presents a new baseline for visual question answering task. Given an image and a question in natural language, our model produces accurate answers according to the content of the image. Our model, while being architecturally simple and relatively small in terms of trainable parameters, sets a new state of the art on both unbalanced and balanced VQA benchmark. On VQA 1.0 open ended challenge, our model achieves 64.6% accuracy on the test-standard set without using additional data, an improvement of 0.4% over state of the art, and on newly released VQA 2.0, our model scores 59.7% on validation set outperforming best previously reported results by 4%. The results presented in this paper are especially interesting because very similar models have been tried before but significantly lower performance were reported. In light of the new results we hope to see more meaningful research on visual question answering in the future.
Read the paper (pdf) »