Download Indexing for Searching - UNC School of Information and Library

Major Issues    Information is mostly online Information is increasing available in full-text (full-content) There is an explosion in the amount of information being produced. 1 Manual Indexing    Human’s read and index content. Fairly good, although not consistent (interobserver, or even intraobserver). Certain fields support costly manual indexing (primary example is Medline). 2 Major Issues   For all fields unable to afford manual indexing, and even for biomedical, because there is so much knowledge in the huge amount of literature being produced that we cannot keep track of it, or utilize it. Research Example: Swanson’s undiscovered literature 3 What this means  Need ways to index without requiring paid experts – Automatic indexing, classification, keyword extraction, and even relationship and fact extraction. – Need to take advantage of experts who are reading the materials to comment on it and provide rankings, summarizations, keywords, “factoids”. (like Amazon) 4 Why Automatic Classification?   Classification is time consuming and expensive Knowledge structuring – To much information  Status of automatic classification – Approaching level of human indexing. (NLM’s Metamap). 5 What is Automatic Classification?  Automatic manipulation of a document’s contents to support logical grouping with other similar documents for organization and/or retrieval activities. Can include the assignment of, or manipulation of, classification notation. 6 Approaches and Methods  Initial approach – Create an inverted file – On-the-fly (natural language processing)  Methods – All words, remove stop words – Word frequencies (Wilson’s objective method of determining aboutness) – More sophisticated IR methods • Semantic/linguistical analysis, • co-occurrence/similarity measures, etc. 7 Simple automatic indexes  Inverted file: contains all the index terms automatically drawn from the document records according to the indexing technique used. – Position of term - record number Field number Number of occurrences Position in the field (digits 45-57) 8 Pros and Cons of Automatic Indexing  Pros – Consistency – Cost reduction – Time reduction  Cons / limitations – – – – Human intellect Term relationships Misleading in retrieval Good algorithms, but generally domain-specific 9 How to gauge effectiveness? Recall Number of relevant documents retrieved out of all the possible relevant documents in system. [quantity—did you get it all?] Precision Percentage of documents retrieved that were relevant [quality of what you found] 10 Tradeoff between Recall and Precision We can easily recall everything that matches a particular text string or pattern; however, we cannot search through all the matching results (too many) We can do an OK job limiting to most relevant, but as we “tune” result to be more relevant, we leave out more and more matching results. 11 Future Search    Full text searching of content, and of associated annotations on content, and metadata (including reader rankings, tags, etc). Like Connotea, NeoNote, etc. Faceted based searching (Endeca, e.g. Home Depot, NCSU library). Clustered based searching (Clusty) 12 Study on gene name searching    Looks at full text searching Tradeoff between precision and recall (Hemminger 2007). 13 Article Discovery Study Schizophrenia + Schizophrenia Gene Genes Found in Metadata Only Schizophrenia Gene Arabidopsis Gene 172 8.58% 3541 20.63% 2712 8.83% 1671 83.38 % 10125 58.99% 5705 18.57% Genes Found in Metadata and Full-text 161 8.03% 3498 20.38% 22305 72.60% Totals for Found Genes 2004 Genes Found in Fulltext Only 17164 30722 14 Article Review Study  Two literature cohorts, – Schizophrenia (Pat Sullivan) – Arabidopsis (Todd Vision)   Each cohort had three readers Readers are asked to “review the article and judge its relevance to them as someone new to the gene in this biological setting, trying to build an understanding of the state of knowledge 15 in that research area.” Metadata Articles More Valuable  In both cases and for all observers, their mean quality rating values were lower (more useful) for the metadata discovered articles. There were statistically significant differences between the mean quality rating for the metadata discovered articles versus the full-text discovered articles for the both the Arabidopsis and Schizophrenia sets16 at the p < 0.05 level Precision and Recall Schizophrenia Recall Precision Arabidopsis Recall Precision Metadata discovered 15.7% (16.6%) 94.7% 84.1% (84.1%) 100% Full-text only discovered 100% 63.7% 100% 69% 17 Article Features that correlate with Value: Number of Hits  The number of hits or matches of the search term within the returned document is a commonly used feature to rank returned articles. To test the value of this feature, the number of hits was correlated with the mean quality ranking for each article (averaged across all observers). The results clearly show a relationship where articles with many matches of the search term, tend to be much more highly valued. 18 Improving Relevance for Metadata Searching  Repeating the calculations on the schizophrenia and Arabidopsis article review sets, but limited to only matches with high hit counts (Schizophrenia ≥ 20 hits and Arabidopsis ≥ 15 hits) shows that precision for the full text is now the same (100% in Aradidopsis) or slightly better than that of the metadata retrieved articles (95% versus 94.4% in schizophrenia). However, the number of additional cases discovered by fulltext searching is now only slightly better, finding 5% more cases in schizophrenia and 28% more in Arabidopsis. 19 Conclusions  This suggests that rather than accepting metadata searching as a surrogate for fulltext searching, it may be time to make the transition to direct full text searching as the standard. This could be accomplished by using certain features of the full-text article, such as number of hits of the search string or whether the search string is found in the metadata (i.e. our current metadata search) as filters that allow us to increase the precision of our results. (and put the user in 20 control of the filtering).

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Indexing for Searching - UNC School of Information and Library