Download Presentation.

Selecting Relevant Documents • Assume: – we already have a corpus of documents defined. – goal is to return a subset of those documents. – Individual documents have been separated into individual files • Remaining components must parse, index, find, and rank documents. • Traditional approach is based on the words in the documents Extracting Lexical Features • Process a string of characters – assemble characters into tokens – choose tokens to index • In place (problem for www) • Standard lexical analysis problem • Lexical Analyser Generator, such as lex, Lexical Analyser • Basic idea is a finite state machine • Triples of input state, transition token, output state A-Z 1 blank A-Z blank, EOF 0 2 • Must be very efficient; gets used a LOT Design Issues for Lexical Analyser • Punctuation – treat as whitespace? – treat as characters? – treat specially? • Case – fold? • Digits – assemble into numbers? – treat as characters? – treat as punctuation? Lexical Analyser • Output of lexical analyser is a string of tokens • Remaining operations are all on these tokens • We have already thrown away some information; makes more efficient, but limits the power of our search Stemming • Additional processing at the token level • Turn words into a canonical form: – “cars” into “car” – “children” into “child” – “walked” into “walk” • Decreases the total number of different tokens to be processed • Decreases the precision of a search, but increases its recall Stemming -- How? • • • • Plurals to singulars (eg, children to child) Verbs to infinitives (eg, talked to talk) Clearly non-trivial in English! Typical stemmers use a context-sensitive transformation grammar: – (.*)SSES -> /1SS • 50-250 rules are typical Noise Words (Stop Words) • Function words that contribute little or nothing to meaning • Very frequent words – If a word occurs in every document, it is not useful in choosing among documents – However, need to be careful, because this is corpus-dependent • Often implemented as a discrete list (stop.wrd on CD) Example Corpora • We are assuming a fixed corpus. Text uses two sample corpora: – AI Abstracts – Email. Anyone’s email. • Textual fields, structured attributes • Textual: free, unformatted, no metainformation • Structured: additional information beyond the content Structured Atributes for AI Theses • • • • • • Thesis # Author Year University Advisor Language Textual Fields for AIT • Abstract – Reasonably complete standard academic English capturing the basic meaning of document • Title – Short, formalized, captures most critical part of meaning – (proxy for abstract) Indexing • We have a tokenized, stemmed sequence of words • Next step is to parse document, extracting index terms – Assume that each token is a word and we don’t want to recognize any more complex structures than single words. • When all documents are processed, create index Basic Indexing Algorithm • For each document in the corpus – get the next token – save the posting in a list • doc ID, frequency • For each token found in the corpus – calculate #doc, total frequency – sort by frequency (p 53-54) Fine Points • Dynamic Corpora • Higher-resolution data (eg, char position) • Giving extra weight to proxy text (typically by doubling or tripling frequency count) • Document-type-specific processing – In HTML, want to ignore tages – In email, maybe want to ignore quoted material Choosing Keyword • Don’t necessarily want to index on every word – Takes more space for index – Takes more processing time – May not improve our resolving power • How do we choose keywords? – Manually – Statistically • Exhaustivity vs specificity Manually Choosing Keywords • Unconstrained vocabulary: allow creator of document to choose whatever he/she wants – “best” match – captures new terms easily – easiest for person choosing keywords • Constrained vocabulary: hand-crafted ontologies – can include hierarchical and other relations – more consistent – easier for searching; possible “magic bullet” search Examples of Constrained Vocabularies • ACM headings • Medline Subject Headings Automated Vocabulary Selection • Frequency: Zipf’s Law – Within one corpus, words with middle frequencies are typically “best” • Document-oriented representation bias: lots of keywords/document • Query-Oriented representation bias: only the “most typical” words. Assumes that we are comparing across documents. Choosing Keywords • “Best” depends on actual use; if a word only occurs in one document, may be very good for retrieving that document; not, however, very effective overall. • Words which have no resolving power within a corpus may be best choices across corpora • Not very important for web searching; will be more relevant for text mining. Keyword Choice for WWW • We don’t have a fixed corpus of documents • New terms appear fairly regularly, and are likely to be common search terms • Queries that people want to make are wideranging and unpredictable • Therefore: can’t limit keywords, except possibly to eliminate stop words. • Even stop words are language-dependent. So determine language first.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Presentation.