Download Chapter 1 - WordPress.com

Chapter 2 Information Retrieval Chapter2 in the textbook Sections: 2.1, 2.2 (2.2.1, 2.2.2), 2.3 1 (2.3.1, 2.3.2, 2.3.3), 2.4(2.4.1, 2,4,2) 2 Modern Information Retrieval  Document   Using keywords Relative weight of keywords  Query   representation Keywords Relative importance of keywords  Retrieval  representation model Similarity between document and query  Rank the documents  Performance evaluation of the retrieval process 3 Document Representation Transforming a text document to a weighted list of keywords 4 Stopwords Figure 2.2 A partial list of stopwords 5 Sample Document Data Mining has emerged as one of the most exciting and dynamic fields in computing science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion. Data mining refers to a family of techniques used to detect interesting nuggets of relationships/knowledge in data. While the theoretical underpinnings of the field have been around for quite some time (in the form of pattern recognition, statistics, data analysis and machine learning), the practice and use of these techniques have been largely ad-hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. The distributed nature of several databases, their size and the high complexity of many techniques present interesting computational challenges. 6 List of words in d1 after deleting stopwords 7 Stemming A given word may occur in a variety of syntactic forms    plurals past tense gerund forms (a noun derived from a verb) Example The word connect, may appear as  connector, connection, connections, connected, connecting, connects, preconnection, and postconnection. 8 Stemming A stem is what is left after its affixes (prefixes and suffixes) are removed Suffixes  connector, connection, connections, connected, connecting, connects, Prefixes  preconnection, and postconnection. Stem  connect 9 Porter’s Algorithm    Letters A, E, I, O, and U are vowels A consonant in a word is a letter other than A, E, I, O, or U, with the exception of Y The letter Y is a vowel if it is preceded by a consonant, otherwise it is a consonant   For example, Y in synopsis is a vowel, while in toy, it is a consonant A consonant in the algorithm description is denoted by c, and a vowel by v 10 Porter’s Algorithm  m is the measure of vc repetition        m=0 m=1 m=2 TR, EE, TREE, Y, BY TROUBLE, OATS, TREES, IVY TROUBLES, PRIVATE, OATEN, ORRERY *S – the stem ends with S (Similarly for other letters) *v* - the stem contains a vowel *d – the stem ends with a double consonant (e.g., -TT) *o – the stem ends cvc, where the seconds c is not W, X, or Y (e.g. -WIL) Porter’s algorithm Step 1 Step 1: plurals and past participles 11 12 Porter’s algorithm - Step 2 Steps 2–4: straightforward stripping of suffixes 13 Porter’s algorithm Step 3 Steps 2–4: straightforward stripping of suffixes 14 Porter’s algorithm Step 4 Steps 2–4: straightforward stripping of suffixes 15 Porter’s algorithm Step 5 Steps 5: tidying-up 16 Example  generalizations      Step1: GENERALIZATION Step2: GENERALIZE Step3: GENERAL Step4: GENER OSCILLATORS     Step1: OSCILLATOR Step2: OSCILLATE Step4: OSCILL Step5: OSCIL 17 Porter’s algorithm Suffix stripping of a vocabulary of 10,000 words (http://www.tartarus.org/~martin/) 18 Document Representation 19 Term-Document Matrix • • • • Term-document matrix (TDM) is a twodimensional representation of a document collection. Rows of the matrix represent various documents Columns correspond to various index terms Values in the matrix can be either the frequency or weight of the index term (identified by the column) in the document (identified by the row). 20 Term-Document matrix 21 Sparse Matrixes- triples 22 Sparse Matrixes- Pairs 23 Normalization • raw frequency values are not useful for a retrieval model • prefer normalized weights, usually between 0 and 1, for each term in a document • dividing all the keyword frequencies by the largest frequency in the document is a simple method of normalization: 24 Normalized Term-Document Matrix 25 Vector Representation of document d1 (word, frequency, normalized frequency) 26 Retrieval models Retrieval models match query with documents to:  separate documents into relevant and non-relevant class  rank the documents according to the relevance 27 Retrieval models Boolean model Vector space model (VSM) Probabilistic models 28 Boolean Retrieval Model 29 Boolean Retrieval Model      One of the simplest and most efficient retrieval mechanisms Based on set theory and Boolean algebra Conventional numeric representations of false as 0 and true as 1 Boolean model is interested only in the presence or absence of a term in a document In the term-document matrix replace all the nonzero values with 1 30 Boolean Term-document Matrix 31 Example Document set  DocSet(K0) = {D1,D3,D5}  DocSet(K4)={D2,D3,D4,D6} Query  K0 and K4  K0 or K4 32 K0 or (not K3 and K5) 33 Boolean Query  User Boolean queries are usually simple Boolean expressions  A Boolean query can be represented in a “disjunctive normal form” (DNF)    disjunction corresponds to or conjunction refers to and DNF consists of a disjunction of conjunctive Boolean expressions 34 DNF form  K0 or (not K3 and K5) is in DNF  DNF query processing can be very efficient  If any one of the conjunctive expressions is true, the entire DNF will be true  Short-circuit the expression evaluation  Stop matching the expression with a document as soon as a conjunctive expression matches the document; label the document as relevant to the query 35 Boolean Model Advantages  Simplicity  Binary and efficiency of implementation values can be stored using bits  reduced storage requirements  retrieval using bitwise operations is efficient  Boolean retrieval was adopted by many commercial bibliographic systems  Boolean queries queries are akin to database 36 Boolean Model Disadvantages A document is either relevant or non-relevant to the query  It is not possible to assign a degree of relevance  Complicated Boolean queries are difficult for users  Boolean queries retrieve too few or too many documents.   K0 and K4 retrieved only 1 out of 6 documents K0 or K4 retrieved 5 out of a possible 6 documents 37 Vector Space Model (VSM) 38 Vector Space Model  Treats both the documents and queries as vectors A weight based on the frequency in the document: 39 Graphical representation of the VSM Model 40 41 Computing the similarity 42 Relevance Values and Ranking Ranking D0 (0.7774) D6 (0.4953) D2 (0.3123) D1 (0.2590) D5 (0.2122) D4 (0.1727) D3 (0.1084) 43 Variations of VSM  Variations of the normalized frequency  Inverse document frequency (idf)  N = no. of documents  nj = no. of documents containing jth term  Modified weights : 44 Inverse Document Frequencies for Collection (normalized) 7 idf 0  idf1  idf 2  idf3  log  0.368 3 7 idf 4  idf 5  idf 6  log  0.243 4 45 TDM using idf 46 q  (0,0.2,0.6,0,0.2,0.3,0) Ranking D0 (0.7867) D6 (0.4953) D2 (0.3361) D1 (0.2590) D5 (0.2215) D4 (0.1208) D3 (0.0969) 47 VSM vs. Boolean  Queries are easier to express: allow users to attach relative weights to terms A descriptive query can be transformed to a query vector similar to documents  Matching between a query and a document is not precise: document is allocated a degree of similarity  Documents are ranked based on their similarity scores instead of relevant/non-relevant classes  Users can go through the ranked list until their information needs are met. 48 Evaluation of Retrieval Performance 49 Evaluation of Retrieval Performance Evaluation should include:  Functionality  Response time  Storage requirement  Accuracy 50 Accuracy Testing Early days:  Batch testing  Document collection such as cacm.all  Query collection such as query.text Present day: interactive tests are used  Difficult  Batch to conduct and time consuming testing still important 51 Precision and Recall Precision How many from the retrieved are relevant? Recall How many from the relevant are retrieved? 52        Our earlier example illustrating the VSM o Documents from Fig. 2.15 o query q  (0,0.2,0.6,0,0.2,0.3,0) Ranking 1. D0* 2. D6 3. D2* 4. D1 5. D5* 6. D4 7. D3* Semantic analysis: documents with asterisk as relevant Retrieved the three top ranked documents Relevant documents: R  {D0, D2, D5, D3} Retrieved documents: A  {D0, D6,D2} R A  {D0, D2} R A {D0,D2} 2 precision     0.67 A {D0,D6,D2} 3 recall  R A R  {D0,D2} {D0,D2,D5,D3}  2  0.5 4 53 F-measure precision  recall 2  precision  recall F  precision  recall  precision  recall    2   2  precision  recall 2  0.67  0.5 0.67 F    0.57 precision  recall 0.67  0.5 1.17 54 Average Precision Three retrieved document was arbitrary Rank retrieved 1 2 3 4 5 6 7 Precision 1.00 0.50 0.67 0.50 0.60 0.50 0.57 Recall 0.25 0.25 0.50 0.50 0.75 0.75 1.00 55 Relationship between precision and recall 56 Average Precision N  precision(i)  relevance(i) Average Precision = i 1 R 1.00 1  0.50  0  0.67 1  0.50  0  0.60 1  0.50  0  0.57 1 4 2.84  4  0.71 Average Precision =

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 1 - WordPress.com