Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Information Retrieval CSE 8337 Spring 2003 Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book Motivation IR: representation, storage, organization of, and access to information items Focus is on the user information need User information need: Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. Emphasis is on the retrieval of information (not data) CSE 8337 Spring 2003 2 DB vs IR Records (tuples) vs. documents Well defined results vs. fuzzy results DB grew out of files and traditional business systesm IR grew out of library science and need to categorize/group/access books/articles CSE 8337 Spring 2003 3 DB vs IR (cont’d) Data retrieval which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure! Information retrieval information about a subject or topic semantics is frequently loose small errors are tolerated IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important CSE 8337 Spring 2003 4 Motivation IR in the last 20 years: classification and categorization systems and languages user interfaces and visualization Still, area was seen as of narrow interest Advent of the Web changed this perception once and for all universal repository of knowledge free (low cost) universal access no central editorial board many problems though: IR seen as key to finding the solutions! CSE 8337 Spring 2003 5 Basic Concepts The User Task Retrieval Database Browsing Retrieval information or data purposeful Browsing glancing around cars, Le Mans, France, tourism CSE 8337 Spring 2003 6 Basic Concepts Logical view of the documents Accents spacing Docs stopwords Noun groups stemming Manual indexing structure structure Full text Index terms Document representation viewed as a continuum: logical view of docs might shift CSE 8337 Spring 2003 7 The Retrieval Process Text User Interface 4, 10 user need Text Text Operations 6, 7 logical view logical view Query Operations DB Manager Module Indexing user feedback 5 query Searching 8 inverted file Index 8 retrieved docs Text Database Ranking ranked docs 2 CSE 8337 Spring 2003 8 Fuzzy Sets and Logic Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. f(x): Probability x is in F. 1-f(x): Probability x is not in F. EX: T = {x | x is a person and x is tall} Let f(x) be the probability that x is tall Here f is the membership function CSE 8337 Spring 2003 9 Fuzzy Sets CSE 8337 Spring 2003 10 IR is Fuzzy Reject Reject Accept Simple CSE 8337 Spring 2003 Accept Fuzzy 11 Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”. CSE 8337 Spring 2003 12 Information Retrieval Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Metrics: Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant| CSE 8337 Spring 2003 13 IR Query Result Measures IR CSE 8337 Spring 2003 14