Download Information Retrieval - Lyle School of Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Information Retrieval
CSE 8337
Spring 2003
Introduction/Overview
Material for these slides obtained from:
Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/
Data Mining Introductory and Advanced Topics by Margaret H. Dunham
http://www.engr.smu.edu/~mhd/book
Motivation



IR: representation, storage,
organization of, and access to
information items
Focus is on the user information need
User information need:


Find all docs containing information on college
tennis teams which: (1) are maintained by a USA
university and (2) participate in the NCAA
tournament.
Emphasis is on the retrieval of information (not
data)
CSE 8337 Spring 2003
2
DB vs IR




Records (tuples) vs. documents
Well defined results vs. fuzzy results
DB grew out of files and traditional
business systesm
IR grew out of library science and need
to categorize/group/access
books/articles
CSE 8337 Spring 2003
3
DB vs IR (cont’d)
Data retrieval
which docs contain a set of keywords?
Well defined semantics
a single erroneous object implies failure!
Information retrieval
information about a subject or topic
semantics is frequently loose
small errors are tolerated
IR system:
interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important
CSE 8337 Spring 2003
4
Motivation
IR in the last 20 years:
classification and categorization
systems and languages
user interfaces and visualization
Still, area was seen as of narrow interest
Advent of the Web changed this perception
once and for all
universal repository of knowledge
free (low cost) universal access
no central editorial board
many problems though: IR seen as key to finding the
solutions!
CSE 8337 Spring 2003
5
Basic Concepts
The User Task
Retrieval
Database
Browsing
Retrieval
information or data
purposeful
Browsing
glancing around
cars, Le Mans, France, tourism
CSE 8337 Spring 2003
6
Basic Concepts
Logical view of the documents
Accents
spacing
Docs
stopwords
Noun
groups
stemming
Manual
indexing
structure
structure
Full text
Index terms
Document representation viewed as a continuum:
logical view of docs might shift
CSE 8337 Spring 2003
7
The Retrieval Process
Text
User
Interface
4, 10
user need
Text
Text Operations
6, 7
logical view
logical view
Query
Operations
DB Manager
Module
Indexing
user feedback
5
query
Searching
8
inverted file
Index
8
retrieved docs
Text
Database
Ranking
ranked docs
2
CSE 8337 Spring 2003
8
Fuzzy Sets and Logic




Fuzzy Set: Set membership function is a real
valued function with output in the range [0,1].
f(x): Probability x is in F.
1-f(x): Probability x is not in F.
EX:
 T = {x | x is a person and x is tall}
 Let f(x) be the probability that x is tall
 Here f is the membership function
CSE 8337 Spring 2003
9
Fuzzy Sets
CSE 8337 Spring 2003
10
IR is Fuzzy
Reject
Reject
Accept
Simple
CSE 8337 Spring 2003
Accept
Fuzzy
11
Information Retrieval






Information Retrieval (IR): retrieving
desired information from textual data.
Library Science
Digital Libraries
Web Search Engines
Traditionally keyword based
Sample query:
Find all documents about “data mining”.
CSE 8337 Spring 2003
12
Information Retrieval



Similarity: measure of how close a query is
to a document.
Documents which are “close enough” are
retrieved.
Metrics:
 Precision = |Relevant and Retrieved|
|Retrieved|
 Recall = |Relevant and Retrieved|
|Relevant|
CSE 8337 Spring 2003
13
IR Query Result Measures
IR
CSE 8337 Spring 2003
14