arxivst stuff from arxiv that you should probably bookmark

Fonduer: Knowledge Base Construction from Richly Formatted Data

Abstract · Mar 15, 2017 09:12 ·

cs-db

Arxiv Abstract

  • Sen Wu
  • Luke Hsiao
  • Xiao Cheng
  • Braden Hancock
  • Theodoros Rekatsinas
  • Philip Levis
  • Christopher Ré

We introduce Fonduer, a knowledge base construction (KBC) framework for richly formatted information extraction (RFIE), where entity relations and attributes are conveyed via structural, tabular, visual, and textual expressions. Fonduer introduces a new programming model for KBC built around a unified data representation that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer is the first KBC system for richly formatted data and uses a human-in-the-loop paradigm for training machine learning systems, referred to as data programming. Data programming softens the burden of traditional supervision by only asking users to provide lightweight functions that programmatically assign (potentially noisy) labels to the input data. Fonduer’s unified data model, together with data programming, allows users to use domain expertise as weak signals of supervision that help guide the KBC process over richly formatted data. We evaluate Fonduer on four real-world applications over different domains and achieve an average improvement of 42 F1 points over the upper bound of state-of-the-art approaches. In some domains, our users have produced up to 1.87x the number of correct entires compared to expert-curated public knowledge bases. Fonduer scales gracefully to millions of documents and is used in both academia and industry to create knowledge bases for real-world problems in many domains.

Read the paper (pdf) »