Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CENG 776 Information Retrieval Nihan Kesim Çiçekli email: [email protected] URL: http://www.ceng.metu.edu.tr/~nihan /60 1 Course Description This course is about • Processing • Indexing • Retrieving • … textual data Fits in four lines, but much more complex and interesting than that 2 /60 Course Resources Course web page • cow.ceng.metu.edu.tr Text Book • Introduction to Information Retrieval by Christopher D. Manning et al., Cambridge University Press, 2008. Reference Material • Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell. • Information Retrieval: Algorithms and Heuristics by D. Grossman and O. Frieder. • Modern Information Retrieval, by R. Baeza-Yates and B. RibeiroNeto. • Search Engines: Information Retrieval in Practice, B. Croft et al. 3 /60 Grading • • • • Written assignments Presentation Project Final Exam 20 % 20 % 30 % 30 % 4 /60 Course Content • • • • • • • • • Boolean Retrieval Indexing and TF-IDF Components of an IR system Evaluation in information retrieval Retrieval Models: Boolean, Vector Space Model, Probabilistic Model Web search basics Web crawling, indexes and link analysis Text Classification Text clustering 5 /60 The task of Information Retrieval (IR) • A large repository of documents is stored on computers (corpus) • There is a topic about which the user wants to get some information (information need) • Some of those documents may contain the information that satisfies user’s need (relevance) • How does the user retrieve those documents? • The user communicates information need to the computer by expressing it in the form of a query. 6 /60 Structured data • How the query is expressed will depend on whether the data in the document corpus is structured or unstructured. • Structured data tends to refer to information in ˝tables˝ and has a clear semantic structure. Employee Manager Salary Smith Jones 5000 Hopkins Smith 6000 Black Jones 5000 7 /60 Structured data • Structured data allows for expressive queries like: List the social security numbers of all employees who have worked more than 5 years and whose salaries are above the average salary of all employees. 8 /60 Unstructured data • Unstructured data does not have a clear semantic structure (e.g. Free text on a web page, audio, video). • Allows less expressive queries of the form: Give me all documents that have the keywords ‘culture of cities’ Structured data Unstructured data Database systems Information retrieval 9 /60 Information Need • Example of an information need in the context of the world wide web: Find all documents (information!) about universities in Turkey that (1) offer master degrees in Computer Engineering and (2) are registered with ACM SIGIR. The information (the document!) should include full curriculum, student campus, e-mail and other con- tact details. • Formal representation of an information need = Query 10 /60 Information Retrieval / Data Retrieval Information Retrieval Data Retrieval Matching vague exact Model probabilistic deterministic Query language natural artificial Query specification incomplete complete Items wanted relevant all (matching) Error handling insensitive sensitive 11 /60 12 /60 13 /60 Semi-structured data • In reality, almost no data is truly unstructured • Even this slide has identified zones such as title and bullets. • Headings, paragraphs and footnotes are represented by explicit markup. • Semi-structured data facilitates queries such as: List all documents in which Title contains data and Bullets contain markup. 14 /60 Definition (IR) • Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) from a corpus, that satisfies an information need. 15 /60 What Types of Information? • • • • • • • Text (Documents) XML and structured documents Images Audio (sound effects, songs, etc.) Video Source code Applications/Web services 16 /60 The Information Retrieval Cycle Source Selection Resource Query Formulation Query Search Ranked List Selection Documents query reformulation, relevance feedback result 17 /60 Information Retrieval Representation, storage, organisation and access of information (information items, information objects, documents). Find relevant (useful) information • Goal of an IR system – RECALL: Retrieve all relevant documents (e.g. Law documents) • Goal of an IR system – PRECISION: Retrieve the most relevant documents (e.g. web). • Goal of an IR system: – Retrieve as few non-relevant documents as possible. – Retrieve relevant documents before non-relevant documents. 18 /60 The IR Black Box Query Documents Results 19 /60 Inside The IR Black Box Query Documents Representation Representation Query Representation Document Representation Comparison Function Index Results 20 /60 The Central Problem in IR Information Seeker Authors Concepts Concepts Query Terms Document Terms Do these represent the same concepts? Why IR is hard? Because Language is hard!!! /60 21 Relevance • Relevance is a subjective judgment and may include: – Being on the proper subject. – Being timely (recent information). – Being authoritative (from a trusted source). – Satisfying the goals of the user and his/her intended use of the information (information need). 22 /60 Keyword Search • Simplest notion of relevance is that the query string appears verbatim in the document. • Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words). 23 /60 Problems with Keywords • May not retrieve relevant documents that include synonymous terms. – “restaurant” vs. “café” – “PRC” vs. “China” • May retrieve irrelevant documents that include ambiguous terms. – “bat” (baseball vs. mammal) – “Apple” (company vs. fruit) – “bit” (unit of data vs. act of eating) 24 /60 Beyond Keywords • We will cover the basics of keyword-based IR • We will focus on extensions and recent developments that go beyond keywords. • We will cover the basics of building an efficient IR system • We will focus on basic capabilities and algorithms rather than systems issues that allow scaling to industrial size databases. 25 /60 Intelligent IR • Taking into account the meaning of the words used. • Taking into account the order of words in the query. • Adapting to the user based on direct or indirect feedback. • Taking into account the authority of the source. 26 /60 IR System Architecture User Interface Text User Need Text Operations Logical View User Feedback Query Ranked Docs Query Operations Searching Ranking Indexing Index Database Manager Inverted file Retrieved Docs Text Database 27 /60 IR System Components • Text Operations forms index words (tokens). – Stopword removal – Stemming • Indexing constructs an inverted index of word to document pointers. • Searching retrieves documents that contain a given query token from the inverted index. • Ranking scores all retrieved documents according to a relevance metric. 28 /60 IR System Components (continued) • User Interface manages interaction with the user: – Query input and document output. – Relevance feedback. – Visualization of results. • Query Operations transform the query to improve retrieval: – Query expansion using a thesaurus. – Query transformation using relevance feedback. 29 /60 What to Evaluate? – Coverage of information – Form of presentation – Effort required/ease of use – Time and space efficiency – Recall • proportion of relevant material actually retrieved – Precision • proportion of retrieved material actually relevant 30 /60 Precision vs. Recall All docs Retrieved Relevant | RelRetriev ed | Recall | Rel in Collection | | RelRetriev ed | Precision | Retrieved | 31 /60 Beyond Just Search • Text Categorization • Document/Term Clustering • Text Summarization 32 /60 Web Search • Application of IR to HTML documents on the World Wide Web. • Differences: – Must assemble document corpus by spidering the web. – Can exploit the structural layout information in HTML (XML). – Documents change uncontrollably. – Can exploit the link structure of the web. 33 /60 IR System Document corpus Query String IR System Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . 34 /60 Web Search System Web Spider Document corpus IR System Query String 1. Page1 2. Page2 3. Page3 . . Ranked Documents 35 /60 Other IR-Related Tasks • • • • • • • Automated document categorization Information filtering (spam filtering) Automated document clustering Recommending information or products Information extraction Information integration Question answering 36 /60 Information Retrieval vs Information Extraction • Information Retrieval: Given a set of terms and a set of document terms select only the most relevant document (precision), and preferably all the relevant ones (recall) • Information Extraction: Extract from the text what the document means • IR can FIND documents but needs not “understand” them. 37 /60 History of IR • 1960-70’s: – Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law and business documents. – Development of the basic Boolean and vectorspace models of retrieval. 38 /60 IR History (continued) • 1980’s: – Large document database systems, many run by companies: • Lexis-Nexis • Dialog • MEDLINE 39 /60 IR History (continued) • 1990’s: – Searching FTPable documents on the Internet • Archie • WAIS – Searching the World Wide Web • Lycos • Yahoo • Altavista 40 /60 IR History (continued) • 1990’s continued: – Organized Competitions • NIST TREC – Recommender Systems • Ringo • Amazon • NetPerceptions – Automated Text Categorization & Clustering 41 /60 Recent IR History • 2000’s – Link analysis for Web Search • Google – Automated Information Extraction • Whizbang • Fetch • Burning Glass – Question Answering • TREC Q/A track 42 /60 Recent IR History • 2000’s continued: – Multimedia IR • Image • Video • Audio and music – Cross-Language IR • DARPA Tides – Document Summarization 43 /60 Related Areas • • • • • Database Management Library and Information Science Artificial Intelligence Natural Language Processing Machine Learning 44 /60 Database Management • Focused on structured data stored in relational tables rather than free-form text. • Focused on efficient processing of well-defined queries in a formal language (SQL). • Clearer semantics for both data and queries. • Recent move towards semi-structured data (XML) brings it closer to IR. 45 /60 Library and Information Science • Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization). • Concerned with effective categorization of human knowledge. • Concerned with citation analysis and bibliometrics (structure of information). • Recent work on digital libraries brings it closer to CS & IR. 46 /60 Artificial Intelligence • Focused on the representation of knowledge, reasoning, and intelligent action. • Formalisms for representing knowledge and queries: – First-order Predicate Logic – Bayesian Networks • Recent work on web ontologies and intelligent information agents brings it closer to IR. 47 /60 Natural Language Processing • Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse. • Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords. 48 /60 Natural Language Processing: IR Directions • Methods for determining the sense of an ambiguous word based on context (word sense disambiguation). • Methods for identifying specific pieces of information in a document (information extraction). • Methods for answering specific NL questions from document corpora. 49 /60 Machine Learning • Focused on the development of computational systems that improve their performance with experience. • Automated classification of examples based on learning concepts from labeled training examples (supervised learning). • Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning). 50 /60 Machine Learning: IR Directions • Text Categorization – Automatic hierarchical classification (Yahoo). – Adaptive filtering/routing/recommending. – Automated spam filtering. • Text Clustering – Clustering of IR query results. – Automatic formation of hierarchies (Yahoo). • Learning for Information Extraction • Text Mining 51 /60 IR systems on the Web • Search for Web pages http://www.google.com • Search for images http://www.picsearch.com • Search for answers to questions http://www.askjeeves.com • Music retrieval http://www.rotorbrain.com/foote/musicr/ 52 /60 53 /60 54 /60 55 /60 56 /60 57 /60 58 /60 Open-source IR toolkits • Smart (Cornell) • MG (RMIT & Melbourne, Australia; Waikato, New Zealand), • Lemur (CMU/Univ. of Massachusetts) • Lucene (Nutch) 59 /60 Information Retrieval Models • • • • The Boolean Model The Vector Space Model Probabilistic models Language models 60 /60