Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19 Last Time Finding Out About Intro to Standard Information Retrieval Intro to Boolean Queries Marti A. Hearst SIMS 202, Fall 1997 Finding Out About Three phases: Asking of a question Construction of an answer Assessment of the answer Part of an iterative process Marti A. Hearst SIMS 202, Fall 1997 Finding Out About is an Iterative Process Repositories Goals Workspace Marti A. Hearst SIMS 202, Fall 1997 Information Retrieval: A Restricted Form of FOA The system has available only pre-existing, “canned” text passages. Its response is limited to selecting from these passages and presenting them to the user. It must select, say, 10 or 20 passages out of millions or billions! Marti A. Hearst SIMS 202, Fall 1997 Query Languages Express the user’s information need Components: query language program to interpret the language collection to compare the interpreted query against Marti A. Hearst SIMS 202, Fall 1997 Types of Query Languages Boolean Natural language (free style) Hybrid structured and free text Form-based SQL (for database queries) Marti A. Hearst SIMS 202, Fall 1997 Today More on Boolean Queries Database Queries IR vs. Database Queries Marti A. Hearst SIMS 202, Fall 1997 Basic Boolean Queries Components: terms (operands) connectors (operators) AND OR NOT Marti A. Hearst SIMS 202, Fall 1997 Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash) Marti A. Hearst SIMS 202, Fall 1997 Boolean Queries Usually expressed as INFIX operators in IR NOT is UNARY PREFIX operator ((a AND b) OR (c AND (NOT b))) AND and OR can be n-ary operators ((a AND b) OR (c AND b)) (a AND b AND c AND d) Some rules NOT(a) AND NOT(b) = NOT(a OR b) NOT(a) OR NOT(b)= NOT(a AND b) NOT(NOT(a)) = a Marti A. Hearst SIMS 202, Fall 1997 Information need Collections Pre-process text input Parse Query Index Rank Result Sets Run a query, get a result set Two choices Reformulate query, run on entire collection Reformulate query, run on result set Example: Dialog query (Redford AND Newman) -> S1 1450 documents (S1 AND Sundance) ->S2 898 documents Marti A. Hearst SIMS 202, Fall 1997 Information need Collections Pre-process text input Parse Query Rank Index Reformulated Query Re-Rank Ordering of Retrieved Documents Pure Boolean has no ordering In practice: order chronologically order by total number of “hits” on query terms What if one term has more hits than others? Is it better to one of each term or many of one term? Fancier methods have been investigated p-norm is most famous usually impractical to implement usually hard for user to understand Marti A. Hearst SIMS 202, Fall 1997 Boolean Advantages simple queries are easy to understand relatively easy to implement Disadvantages difficult to specify what is wanted too much returned, or too little ordering not well determined Dominant language in commercial systems until the WWW Marti A. Hearst SIMS 202, Fall 1997 Faceted Boolean Query Strategy: break query into facets (polysemous with earlier meaning of facets) conjunction of disjunctions a1 OR a2 OR a3 b1 OR b2 c1 OR c2 OR c3 OR c4 AND each facet expresses a topic “rain forest” OR jungle OR amazon medicine OR remedy OR cure Smith OR Zhou Marti A. Hearst SIMS 202, Fall 1997 AND Faceted Boolean Query Query still fails if one facet missing Alternative: Coordination level ranking Order results in terms of how many facets (disjuncts) are satisfied Also called Quorum ranking, Overlap ranking, and Best Match Problem: Facets still undifferentiated Alternative: assign weights to facets Marti A. Hearst SIMS 202, Fall 1997 Proximity Searches Proximity: terms occur within K positions of one another A “Near” function can be more vague near(pen, paper) Sometimes order can be specified Also, Phrases and Collocations pen w/5 paper “United Nations” “Bill Clinton” Phrase Variants “retrieval of information” “information retrieval” Marti A. Hearst SIMS 202, Fall 1997 Filters Filters: Reduce set of candidate docs Often specified simultaneous with query Usually restrictions on metadata restrict by: date range internet domain (.edu .com .berkeley.edu) author size limit number of documents returned Marti A. Hearst SIMS 202, Fall 1997 SQL: Database Query Language Somewhat like Boolean Geared towards relational model A typical combination: SELECT X FROM Y WHERE Z Marti A. Hearst SIMS 202, Fall 1997 SQL basics SELECT: lists columns of tables FROM: which tables to use WHERE: which rows to include Example: get customer 1234’s bill: SELECT fee FROM accounts WHERE customer_number = 1234 Marti A. Hearst SIMS 202, Fall 1997 Simple SQL Join Given two relations: accounts (customer name and fee) customers (customer name and address) Retrieve all customers names, addresses, and fees: SELECT customer_num, address, fee FROM accounts, customers WHERE customers.customer_num = accounts.customer_num See reader for more details Marti A. Hearst SIMS 202, Fall 1997 IR vs. Database Systems (modified from van Rijsbergen) IR Database System Provides: User’s query: Pointer to data Data item General Specific Retrieval method: Probabilistic, Approximate Deterministic, correct Criteria for success: Utility Efficiency Marti A. Hearst SIMS 202, Fall 1997 Comparing Information Retrieval to Database Retrieval Query: How many SCSI drives did Compaq buy last year? Database: A natural query on a purchasing relation. Text Collection: More difficult to search for Less likely to find a reliable answer Interpretation needed on results. Marti A. Hearst SIMS 202, Fall 1997 Comparing Information Retrieval to Database Retrieval Query: What is the best SCSI disk drive to buy? Database: Complex query combining information from many relations. Probably can’t be written. Text Collection: Find usenet articles with people’s opinions Interpret and use text fragments that may have been written for other purposes. Marti A. Hearst SIMS 202, Fall 1997 Next Time: Assessing the Answer How well do the results answer the question? How relevant are they to the user? Marti A. Hearst SIMS 202, Fall 1997 Assignment -- Text Search! Get a Lexis-Nexis account from Roberta Use it in the lab or at home Many collections are accessible News Public Interest Legal Get to know the interface Do some searches Answer some questions Marti A. Hearst SIMS 202, Fall 1997