Doi: 10.1145/1409360.1409378
say you want to select a quiet, centrally located Manhattan hotel. Google returns an overwhelming seven million results in response to the query “new york city hotels.” or, say you are trying to assemble a program committee for an annual conference composed of researchers who have published at the conference in previous years, and to balance it geographically. While today’s Web search engines identify potentially relevant documents, you are forced to sift through a long list of urls, scan each document to identify any pertinent bits of information, and assemble the extracted findings before you can solve your problem.
over the coming decade, Web searching will increasingly transcend keyword queries in favor of systems that automate the tedious and error-prone task of sifting through documents. Moreover, we
will build systems that fuse relevant pieces of information into a coherent overview, thus reducing from hours to minutes the time required to perform complex tasks.
Information extraction (IE)—a venerable technology that maps nat-ural-language text into structured relational data—offers a promising avenue toward this goal. Although extracting data from text is inherently challenging, given the ambiguous and idiosyncratic nature of natural language, substantial progress has been made over the last few decades.
This article surveys a range of IE methods, but we highlight Open Information Extraction, 3, 4 wherein the identities of the relations to be extracted are unknown and the billions of documents found on the Web necessitate highly scalable processing.
At the core of an IE system is an extractor, which processes text; it overlooks irrelevant words and phrases and attempts to home in on entities and the relationships between them. For example, an extractor might map the sentence “Paris is the stylish capital of France” to the relational tuple (Paris, CapitalOf, France), which might be represented in RDF or another formal language.
Considerable knowledge is necessary to accurately extract these tuples from a broad range of text. Existing techniques obtain it in ways ranging from direct knowledge-based encoding (a human enters regular expressions or rules) to supervised learning (a human provides labeled training examples) to self-supervised learning (the system automatically finds and labels its own examples). Here, we briefly survey these methods.
Knowledge-Based Methods. The first IE systems were domain-specific. A series of DARPA Message Understanding Conferences (MUCs) challenged the NLP community to build systems that handled robust extraction from naturally occurring text. The domain
References:
Archives