Download No Slide Title

Applying Evolutionary Computation Techniques to Web Information Retrieval Chih-Chin Lai, Ph.D. Dept. of Computer Science and Information Engineering National University of Tainan, Taiwan E-mail: [email protected] Nov. 28, 2007 Outlines • Information Retrieval – some related topics • Evolutionary Computation (EC) • Applying EC to Web Information Retrieval • Conclusions 2 Introduction • Definition of Information Retrieval – Salton (1989): Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests. – Kowalski (1997): An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects). 3 Introduction (cont.) • Information Retrieval (IR) – The indexing and retrieval of textual documents • Searching newspaper articles • Searching on the Web – Concerned firstly with retrieving relevant documents to a query – Concerned secondly with retrieving large sets of documents efficiently 4 Typical IR Task • Given – – – • User has information need A corpus of textual natural-language documents A user query in the form of a textual string Find – A ranked set of documents that are relevant to the query 5 Key Qualities • • • Document and query representations Mechanisms for finding relevant documents and ranking the results Mechanisms for obtaining user feedback 6 Typical IR System User Process Process Retrieved relevant(?) documents Store Retrieval Part 7 IR System Relevance • Relevance is a subject judgment – Being on the proper subject – Being timely (recent information) – Satisfying the goals of the user and his/her intended use of the information (information need) 8 IR System Components • Text operations forms index words (tokens) – Stopword removal – Stemming • Indexing maps each keyword to a set of documents that contains the keyword • Searching retrieves documents that contain a given query token from the inverted index • Ranking scores all retrieved documents according to a relevance metric 9 IR System Components (cont.) • User interface manages interaction with the user – Query input and document output – Relevance feedback – Visualization of results • Query operations transform the query to improve retrieval 10 Examples of IR System • Conventional (library catalog): Search by keyword, title, author, etc. 11 Examples of IR System (cont.) • Text-based (Google): Search by keywords. Limited search using queries in natural language 12 Examples of IR System (cont.) • Multimedia (WebSeek): Search by visual appearance (shapes, colors,…) 13 Examples of IR System (cont.) • Question answering systems (AnswerBus): Search in (restricted) natural language 14 Searching the Web • Application of IR to HTML documents on the World Wide Web • Three forms – Use search engines that index a portion of the Web documents as a full-text database – Use Web directories, which classify selected Web documents by subject – Search the Web exploiting its hyperlink structure 15 Web Search System Documents Spider User Process Process Retrieved relevant(?) documents Store World Wide Web Retrieval Part 16 IR System Retrieval Models • A retrieval model specifies the details of: – Document and Query representation – Matching strategies for assessing the relevance of documents to a user query – Methods for ranking query output – Mechanisms for acquiring user-relevance feedback • Notion of relevance can be binary or continuous (i.e. ranked retrieval) 17 Types of IR Models • Boolean model – Simple Boolean queries regarding existence of terms within documents – Easy to understand, but difficult to rank output • Vector space model – Documents are represented by n-dimensional vectors – Typically one dimension per term 18 Types of IR Models (cont.) • Probabilistic model – Start with some user-supplied relevance information about a “training set” of documents – The training set is used to compute term weights by estimating P(t in document | document is relevant) P(t in document | document is irrelevant ) – Useful for theoretical analysis, but probably not in practice (?) 19 Statistical Retrieval • Retrieval based on similarity between query and documents • Output documents are ranked according to similarity to query • Similarity based on occurrence frequencies of keywords in query and document 20 The Vector Space Model • A document is typically represented by a bag of words (unordered words with frequencies) • Assume a vocabulary of t distinct terms • Each term, i, in a document or query, j, is given a real-valued weight, wij • Both documents and queries are expressed as t-dimensional vectors dj = (w1j, w2j, …, wtj) 21 Concept Representation Example: T3 Vdoc1 = 2T1 + 4T2 + 5T3 Vdoc2 = 4T1 + 7T2 + T3 5 Vquery = 0T1 + 0T2 + 2T3 Vdoc1 = 2T1+ 4T2 + 5T3 Vquery = 0T1 + 0T2 + 2T3 2 4 T1 Vdoc2 = 4T1 + 7T2 + T3 T2 7 • Is Vdoc1or Vdoc2 more similar to Vquery? • How to measure the degree of similarity? 22 Term Weights: TF-IDF • More frequent terms in a document are more indicative to the topic fij = frequency of term i in document j tfij = fij / max{fij} (normalization) • Terms that appear in many different documents are less indicative of overall topic df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log(N/ df i) ( where N: total number of documents) 23 TF-IDF Weighting • A typical combined term importance indicator is tf-idf weighting wij = tfij idfi = tfij log (N/ dfi) • A term occurring frequently in the document but rarely in the rest of the collection is given high weight • Experimentally, tf-idf has been found to work well 24 Similarity Measure • A similarity measure is a function that computes the degree of similarity between two vectors • Using a similarity measure between the query and each document – to rank the retrieved documents – to control the size of the retrieved set 25 Similarity Measure (cont.) t3 • Cosine similarity measures the cosine of the angle between two vectors inner product normalized by the vector t lengths   dj q   (wij  wiq )   dj  q  wij   wiq  CosSim(dj, q) = 1 Vdoc1 2 i 1 t i 1 2 t i 1 Vquery t1 2 t2 Vdoc2 Vdoc1 = 2T1 + 4T2 + 5T3 CosSim(Vdoc1 , Vquery) = 10 / (4+16+25)(0+0+4) = 0.75 Vdoc2 = 4T1 + 7T2 + 1T3 CosSim(Vdoc2 , Vquery) = 2 / (16+49+1)(0+0+4) = 0.12 Vquery = 0T1 + 0T2 + 2T3 D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product. 26 Accuracy Measures: Precision and Recall not retrieved Relevant retrieved & not retrieved but documents relevant relevant retrieved & irrelevant Not retrieved & irrelevant irrelevant relevant irrelevant relevant retrieved retrieved not retrieved retrieved & relevant not retrieved but relevant retrieved & irrelevant Not retrieved & irrelevant From all the documents that are retrieved by the IR system, how many are relevant? precision  Number of relevant documents retrieved Total number of retrieved documents From all the documents that are relevant out there, how many did the IR system retrieve? recall  Number of relevant documents retrieved Total number of relevant documents 27 Precision and Recall • Precision – The ability to retrieve top-ranked documents that are mostly relevant • Recall – The ability of the search to find all of the relevant items in the corpus 28 Precision and Recall Variations Narrow query formulation: Returns relevant documents but misses many useful ones The ideal case Precision 1 Broad query formulation: Returns most relevant documents but includes lots of junk 1 0 Recall Figure taken from: Raymond J. Mooney (http://www.cs.utexas.edu/users/mooney/ir-course/) 29 Evolutionary Computation • Definition – EC (GA, GP, ES) solve computational problems by simulating evolution with natural selection – They are stochastic search algorithms which incrementally preserve and combine desirable features of individual potential solutions in a population over an extended period of time Figure taken from: www.genetic-programming.org 30 Template of EC procedure EC begin t := 0; initializePopulation(P(0)); evaluate(P(0)); repeat t := t + 1; P' = selectForVariation((P(t)); recombine(P'); mutate(P'); evaluate(P'); until termination = true; end 31 Applications of EC to IR • EC has been applied to the following problems – – – – – – – – Automatic document indexing Document and term clustering Query definition Matching function learning Image retrieval Design of user profiles for IR on the Internet Web page classification Design of agents for Internet searching 32 MGA for Web Search • Genetic algorithm – John Holland, 1975 – David E. Goldberg, 1989 • Metagenetic algorithm (MGA) – Zacharis and Panayiotopoulos proposed (2001) – A two-stage GA that controls and optimizes both populations simultaneously 33 MGA for Web Search (cont.) • Zacharis and Panayiotopoulos, [IEEE Internet Computing, 2001] 34 Hierarchical Genetic Algorithm • HGA – Tang et al. (1998) proposed – It is a variant of conventional genetic algorithm with hierarchical genetic structure – In HGA, the chromosome consists of two types of genes • the control genes and • the parametric genes • The relationship between parametric genes and control genes is that the activation of former is governed by the value of the latter 35 HGA Representation control genes parametric genes      [1 0 1 1 0 :: 53.2 19.6 34.7 68.2 75.3] chromosome i represents parameters (53.2, 34.7, 68.2) (a) 1st [ level control } 1 0 2nd genes :: level control genes parametric genes               0 1 0 0 1 1 :: 33.2 78.5 46.8 22.1 94.6 55.4 ]  0  control  33 .2  control control     1 1   78 .5   control  0  46 .8 chromosome  0  control  22 .1  control control     0 1   94 .6   control 1  55 .4 j represents a parameter 78.5 (b) 36 HGA for Web Search Chromosome Dictionary W1 > W2 select Keyword1 Randomly generated Keyword1 Keyword2 Control genes 1 0 1 1 0 1 0 0 Parametric genes 1 | news | intelligence | mit | lab | artificial | 1 | mit | artificial | ai | lab | intelligence | 37 HGA for Web Search (cont.) Control genes 1 0 1 1 0 1 0 0 1 1 Cut point Control genes 1 1 1 0 1 0 1 0 0 1 Parametric genes | news | intelligence | mit | lab | artificial | | mit | artificial | ai | lab | intelligence | 38 HGA for Web Search (cont.) Control genes Parametric genes 1 1 1 0 1 | news | intelligence | mit | lab | artificial | Dictionary 0 1 0 0 1 | mit | artificial | ai | lab | intelligence | 39 HGA for Web Search (cont.) Interesting User interface Update PWIS Vector DD Relevance page Recommendation component Query Vector DR World Wide Web Keywords Results by PageRank 40 HGA for Web Search (cont.) 1 0.9 0.8 0.7 適應值 fitness 0.6 0.5 0.4 0.3 0.2 WRA-Keyword HGA-Keyword WRA-Non-Keyword HGA-Non-Keyword MGA GA 0.1 0 1 70 139 208 277 346 415 484 553 622 691 760 829 898 967 1036 1105 1174 1243 1312 1381 1450 1519 染色體 # of chromosomes 41 HGA for Web Search (cont.) Methods Fitness PR Stability Time Score Rank HGA-Keyword WRA -UserKeyword 2 1 1 1 5 1 HGA-Non-Keyword 4 2 3 1 10 2 MGA 1 4 2 4 11 3 GA 3 3 4 3 13 4 42 Profile for Web Search 43 Profile for Web Search 44 Profile for Web Search 45 Profile for Web Search (cont.) 46 Profile for Web Search (cont.) 47 Profile for Web Search (cont.) 48 Conclusions • The aim of a Web IR system is to estimate the relevance of web information items to a user information need expressed in a query – This is a very hard and complex task – It is pervaded with subjectivity, vagueness and impression • The main characteristic of EC is that it is tolerant to impression, vagueness, partial truth, and approximation – EC techniques have been used satisfactorily to improve IR process 49 Conclusions (cont.) Figure taken from: M. Henzinger, “The past, presence, and future of Web Information Retrieval” 50 Web Intelligence • Today's search engines are designed for human consumption: (1) A user queries the SE and gets relevant pages (2) The user reads the pages and extracts manually the information (3) The information must be integrated to produce the desired knowledge (1) (1) (2) (3) (3) 51 Figure taken from: Prof. F. Ciravegna, University of Sheffield, “Web Intelligence” Web Intelligence (cont.) • The future web will have semantics associated to pages and SE will be able to provide semantically-based services Figure taken from: Prof. F. Ciravegna, University of Sheffield, “Web Intelligence” 52 References: Journals • Information Processing and Management • Journal of the American Society of Information Science • Transactions On Information Science • Information Retrieval • Journal of Documentation • Information Retrieval 53 Good books • Van Rijsbergen – “Information Retrieval”, ir.dcs.gla.ac.uk • Sparck Jones and Willett – “Readings in Information Retrieval” • Baeza-Yates and Ribeiro-Neto – “Modern Information Retrieval” • Witten, Moffat and Bell – “Managing Gigabytes” 54

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download No Slide Title