KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni The Problem Numerous NLP applications rely on searchengine queries to: – – Extract information from the web. Compute statistics over the Web corpus. Search engines are extremely helpful for several linguistic tasks such as: – – Computing usage statistics. Finding a subset of web documents to analyze in depth. Problem With Search Engines Search engines were not designed as building blocks for NLP applications. As a result: – An NLP application is forced to issue literally millions of queries to search engines; increasing processing time and limiting scalability. – Fetching web documents is also time-consuming. – Search engines are limiting the use of programmatic queries to their engines Google has placed hard quotas on the number of daily queries a program can issue. Other engines force applications to introduce “courtesy waits” between queries. Example of the Problem “KnowItAll” KnowItAll works in a generate-and-test architecture extracting Information in 2 stages: – – First, it Utilizes a small set of domain independent extraction patterns to generate candidate facts. Second, it automatically tests the plausibility of the candidate facts it extracts using pointwise mutual information (PMI) statistics computed from search-engine hit counts. 1st Stage in KnowItAll Take the generic pattern “NP1 such as NPList2”. This indicates that the head of each simple noun phrase (NP) in NPList2 is a member of the class named in NP1. – – Take as example the pattern for class City, and the sentence “We provide tours to cities such as Paris, London, and Berlin.” KNOWITALL extracts three candidate cities from the sentence: Paris, London, Berlin. 2nd Stage in KnowItAll KnowItAll needs to assess the likelihood of the information it found. Verify that Paris is actually a city. It does that by computing the PMI between Paris and a set of k discriminator phrases that tend to have high mutual information with city names. (Paris is a city) This requires at least k search-engine queries for every candidate extraction! The Solution A novel architecture for Information Extraction which does not depend on Web search-engine queries; KnowItNow. Works over 2 stages like KnowItAll: – – Uses a specialized search engine called the Binding Engine (or BE) which efficiently returns bindings in response to variabilized queries. Uses URNS, a combinatorial model, which estimates the probability that each extraction is correct without using any additional search engine queries The Binding Engine vs. The Traditional Engine The Traditional Engine Take the search query (“Cities such as <NounPhrase>”). Perform a traditional search engine query. For each such URL: – – – – obtain the document contents. find the searched-for terms in the document text. Run the noun phrase recognizer to determine if text found satisfies the linguistic type requirement If it does, return the string. Problems With Traditional Engine The search itself doesn’t take a long time. Even if there are multiple search queries The second stage fetches a large number of documents, each fetch likely resulting in a random disk seek; this stage executes slowly. this disk access is slow regardless of whether it happens on a locally-cached copy or on a remote document server. The Binding Engine Why not use a table to store a list of terms and documents containing them?! The Binding Engine supports these queries: – – – Typed variables (such as NounPhrase) String-processing functions (such as “head(X)” or “ProperNoun(X)”). Standard query terms. It processes a variable by returning every possible string in the corpus that has a matching type, and that can be substituted for the variable and still satisfy the user's query. How the Binding Engine Works? It uses a novel approach called the “neighborhood index” The neighborhood index is an augmented inverted index structure. – – For each term in the corpus, the index keeps a list of documents in which the term appears and a list of positions where the term occurs. The index also keeps a list of left-hand and righthand neighbors at each position. (Adjacent text strings that satisfy a recognizer, e.g. NounPhrase) How is The Binding Engine Better? K is the number of concrete terms in the query. B is the number of variable bindings found in the corpus. N is the number of documents in the corpus. Expensive processing such as part-of-speech tagging or shallow syntactic parsing is performed only once, while building the index, and is not needed at query time. How is The Binding Engine Better? Average time to return the relevant bindings in response to a set of queries. 0.06 CPU minutes for BE. 8.16 CPU minutes for Nutch (Private search engine) Disadvantages of The Binding Engine It consumes a large amount of disk space, as parts of the corpus text are folded into the index several times. The neighborhood index increased disk space four times that of a standard inverted index The URNS Model We need a way to test that the extractions from the Binding Engine are correct KnowItAll issues queries to search engines and uses the PMI model to verify extractions. PMI is very efficient but it is also very slow. How URNS works? URNS is a probabilistic model – It takes the form of a classic “balls-andurns” model from combinatorics. Each extraction is modeled as a labeled ball in an urn. A label represents either an instance of the target class or relation, or represents an error How URNS works? C - the set of unique target labels; |C| is the number of unique target labels in the urn. E - the set of unique error labels; |E| is the number of unique error labels in the urn. num(b) - the function giving the number of balls labeled by b where b is a subset of C U E. num(B) is the multi-set giving the number of balls for each label b, where b is a subset of B. How URNS works? The goal of an IE system is to discern which of the labels it extracts are in fact elements of C. – Given that a particular label x was extracted k times in a set of n draws from the urn, what is the probability that x is a subset of C? Alternative to URNS Items that were extracted more often are more likely to be true. – i.e. Extractions with higher frequencies are true. Experiments Recall: how many distinct extractions does each system return at high precision? Time: how long did each system take to produce and rank its extractions? Extraction Rate: how many distinct high quality extractions does the system return per minute? The extraction rate is simply recall divided by time. KnowItNow vs. KnowItAll Tested on relation “Country” KnowItNow vs. KnowItAll Tested on relation “CapitalOf” KnowItNow vs. KnowItAll Tested on relation “Corp” KnowItNow vs. KnowItAll Tested on relation “CeoOf” KnowItNow vs. KnowItAll Contributions A novel architecture for Information Extraction which does not depend on Web search-engine queries. Extract tens of thousands of facts from the Web in minutes instead of days. KnowItNow's extraction rate is two to three orders of magnitude greater than KnowItAll's.