General Approach to Real World Text Extraction

Post · Apr 13, 2017 15:56 ·

Breakthrough on the French Street Names dataset. This paper boasts a huge improvement over previous state-of-the-art models including Inception V3 and Inception Resnet V2. And since that wasn’t hard enough, they also tested it against Google Street View, cause it’s all like nbd.

Highlight From the Paper

  • Novel attention mechanism allows us to extract structured text information by reading only the interesting parts of the whole image.
  • We achieve 84.2% accuracy on FSNS, significantly outperforming the previous state-of-the-art [10], which achieved 72.46%.
  • 48% of the “errors” are actually due to the incorrect ground truth.



Arxiv Abstract

  • Zbigniew Wojna
  • Alex Gorban
  • Dar-Shyang Lee
  • Kevin Murphy
  • Qian Yu
  • Yeqing Li
  • Julian Ibarz

We present a neural network model - based on CNNs, RNNs and a novel attention mechanism - which achieves 84.2% accuracy on the challenging French Street Name Signs (FSNS) dataset, significantly outperforming the previous state of the art (Smith'16), which achieved 72.46%. Furthermore, our new method is much simpler and more general than the previous approach. To demonstrate the generality of our model, we show that it also performs well on an even more challenging dataset derived from Google Street View, in which the goal is to extract business names from store fronts. Finally, we study the speed/accuracy tradeoff that results from using CNN feature extractors of different depths. Surprisingly, we find that deeper is not always better (in terms of accuracy, as well as speed). Our resulting model is simple, accurate and fast, allowing it to be used at scale on a variety of challenging real-world text extraction problems.

