Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distance functions and IE – 5 William W. Cohen CALD Announcements • Current statistics: – – – – days with unscheduled student talks: 5 students with unscheduled student talks: 3 Projects are due: 4/28 (last day of class) Additional requirement: draft (for comments) no later than 4/21 String distance metrics so far... • Term-based (e.g. TF/IDF as in WHIRL) – Distance depends on set of words contained in both s and t – so sensitive to spelling errors. – Usually weight words to account for “importance” – Fast comparison: O(n log n) for |s|+|t|=n • Edit-distance metrics – Distance is shortest sequence of edit commands that transform s to t. – No notion of word importance – More expensive: O(n2) • Other metrics – Jaro metric & variants – Monge-Elkan’s recursive string matching – etc? • Which metrics work best, for which problems? Results - Overall Combining Information Extraction and Similarity Computations Krauthammer et al Background • Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a large “database” of sequences. – want to find subsequences (genes) that are highly similar (and hence probably related) – want to ignore “accidental” matches – possible technique is Smith-Waterman (local alignment) • want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance Background • Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a large “database” of sequences. – want to find subsequences (genes) that are highly similar (and hence probably related) – want to ignore “accidental” matches – possible technique is Smith-Waterman (local alignment) • want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance Smith-Waterman distance c o h e n d o r f m 0 0 0 0 0 0 0 0 0 c 1 0 0 0 0 0 0 0 0 c 0 0 0 0 0 0 0 0 0 o 0 2 1 0 0 0 2 1 0 h 0 1 4 3 2 1 1 1 0 n 0 0 3 3 5 4 3 2 1 s 0 0 2 2 4 4 3 2 1 k 0 0 1 1 3 3 3 2 1 i 0 0 0 0 2 2 2 2 1 dist=5 In general “peaks” in the matrix scores indicate highly similar substrings. Background • Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a large “database” of sequences. – possible technique is Smith-Waterman (local alignment) • want char-char “reward” for alignment to reflect confidence that the alignment is not due to chance • based on substitutability theory/stats for amino acids – doesn’t scale well • BLAST and FASTA: fast approximate S-W BLAST/FASTA ideas • Find all char n-grams (“words”) in the query string. • FASTA: – Use inverted indices to find out where these words appear in the DB sequence – Use S-W only near DB sections that contain some of these words BLAST/FASTA ideas • Find all char n-grams (“words”) in the query string. • BLAST: – Generate variations of these words by looking for changes that would lead to strong similarities – Discard “low IDF” words (where accidental matches are likely) – Use expanded set of n-grams to focus search query string words and expansions BLAST/FASTA ideas • Find all char n-grams (“words”) in the query string. • BLAST: – Generate variations of these words by looking for changes that would lead to strong similarities – Discard “low IDF” words (where accidental matches are likely) – Use expanded set of n-grams to focus search • The BLAST program: – – – – Widely used, Fast implementation, Supports asking multiple queries against a database at once... Can one use it find soft matches of protein names (from a dictionary) in text? Basic idea: • Biomedical paper • Protein name dictionary • Extracted protein name (dict. entry->text) • IE system: dictionaries+BLAST (optimized for this problem) • Protein database • Query strings • Proposed alignment (query->database) • Query algorithm: BLAST 1) Mapping text to DNA sequences (Q: what sort of char similarity is this?) 2) Optimizing blast • Split protein-name database into several parts (for short, medium-length, long protein names) – Scoring depends on length of matched string • Require space chars before and after “short” protein names. • Manually search (grid search?) for better settings for certain key parameters for each protein-name subdatabase – With what data? • Evaluate on one review article, 1162 protein names – inter-annotator agreement not great (70-85%) 2) Optimizing blast 2) Optimizing blast Results Results Overall: precision 71.1%, recall 78.8% (optimized) IE with Dictionaries Cohen & Sarawagi Finding names you know about • Problem: given dictionary of names, find them in email text – Important task beyond email (biology, link analysis,...) – Exact match is unlikely to work perfectly, due to nicknames (Will Cohen), abbreviations (William C) , misspellings (Willaim Chen), polysemous words (June, Bill), etc – In informal text it sometimes works very poorly – Problem is similar to record linkage (aka data cleaning, de-duping, merge-purge, ...) problem of finding duplicate database records in heterogeneous databases. Finding names you know about • Problem: given dictionary of names, find them in email text – Exact match is unlikely to work well for informal text. – Problem is similar to record linkage – Hard to combine state of the art similarity metrics (as used in record linkage) with state of the art NER system due to representational mismatch: • Opening up the box, modern NER systems don’t really know anything about names.... IE as Sequential Word Classification A trained IE system models the relative probability of labeled sequences of words. person name location name background To classify, find the most likely state sequence for the given words: Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos IE as Sequential Word Classification Modern IE systems use a rich representation for words, and clever probabilistic models of how labels interact in a sequence, but do not explicitly represent the names extracted. wt - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is “Wisniewski” is in a list of city names is under node X in WordNet part of ends in is in bold font noun phrase “-ski” is indented O t 1 is in hyperlink anchor last person name was female next two words are “and Associates” wt wt+1 … … Ot O t +1 Semi-Markov models for IE with Sunita Sarawagi, IIT Bombay • Train on sequences of labeled segments, not labeled words. S=(start,end,label) • Build probability model of segment sequences, not word sequences • Define features f of segments • (Approximately) optimize feature weights on training data f(S) = words xt...xu, length, previous words, case information, ..., distance to known name m maximize: log Pr(S i 1 i | xi) Details: Semi-Markov model Details: Semi-Markov model Conditional Semi-Markov models CMM: CSMM: A training algorithm for CSMM’s (1) Review: Collins’ perceptron training algorithm Correct tags Viterbi tags A training algorithm for CSMM’s (2) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TTRANS like Viterbi A training algorithm for CSMM’s (3) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TTRANS like Viterbi A training algorithm for CSMM’s (3) Variant of Collins’ perceptron training algorithm: voted perceptron learner for TSEGTRANS like Viterbi Sample CSMM features Experimental results • Baseline algorithms: – HMM-VP/1: tags are “in entity”, “other” – HMM-VP/4: tags are “begin entity”, “end entity”, “continue entity”, “unique”, “other” – SMM-VP: all features f(w) have versions for “f(w) true for some w in segment that is first (last, any) word of segment” – dictionaries: like Borthwick • HMM-VP/1: fD(w)=“word w is in D” • HMM-VP/4: fD,begin(w)=“word w begins entity in D”, etc, etc • Dictionary lookup Datasets used Used small training sets (10% of available) in experiments. Results Results: varying history Results: changing the dictionary Results: vs CRF Results: vs CRF