Download ApplicationHumanitiesAndSocialScience

Grid-based Search and Data Mining Using Cheshire3 Presented by Ray R. Larson University of California, Berkeley School of Information In collaboration with Robert Sanderson University of Liverpool Department of Computer Science ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 1 Overview • • • • • • • • • Introduction Context Architecture Grid Text Mining Data Mining Applications Future Plans and Applications Questions? ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 2 Introduction • Cheshire History: – Developed at UC Berkeley originally – Solution for library data (C1), then SGML (C2), then XML – Monolithic applications for indexing and retrieval server in C + TCL scripting • Cheshire3: – – – – Developed at Liverpool, plus Berkeley XML, Unicode, Grid scalable: Standards based Object Oriented Framework Easy to develop and extend in Python ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 3 Introduction • Today: – Version 0.9.4 – Mostly stable, but needs thorough QA and docs – Grid, NLP and Classification algorithms integrated • Near Future: – June: Version 1.0 • Further DM/TM integration, docs, unit tests, stability – December: Version 1.1 • Grid out-of-the-box, configuration GUI ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 4 Context • Environmental Requirements: – Very Large scale information systems • Terabyte scale (Data Grid) • Computationally expensive processes (Comp. Grid) • Digital Preservation • Analysis of data, not just retrieval (Data/Text Mining) • Ease of Extensibility, Customizability (Python) • Open Source • Integrate not Re-implement • "Web 2.0" – interactivity and dynamic interfaces ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 5 Context Application Layer User Interface Web Browser Multivalent Dedicated Client Query Digital Library Layer Data Mining Tools Text Mining Tools Orange, Weka, ... User Interface Tsujii Labs, ... Natural Information Language Extraction Processing MySRB PAWN Classification Clustering Results Information System Cheshire3 Protocol Handler Apache+ Mod_Python+ Cheshire3 Query Data Grid Layer Data Grid Store Query Results Search / Retrieve SRB iRODS Index / Store Results Process Management Term Management Kepler Cheshire3 Termine WordNet ... ISGC 2007 - Taipei, Taiwan Document Parsers Process Management Multivalent,... Export Parse Kepler iRODS rules 2007.03.29 SLIDE 6 Cheshire3 Object Model Protocol Handler ConfigStore Ingest Process Documents Object Transformer Server Records User Document Query UserStore Document Group ResultSet Database PreParser PreParser PreParser Query Document Index Extracter RecordStore Parser Normaliser Terms IndexStore ISGC 2007 - Taipei, Taiwan Record DocumentStore 2007.03.29 SLIDE 7 Object Configuration • One XML 'record' per non-data object • Very simple base schema, with extensions as needed • Identifiers for objects unique within a context (e.g., unique at individual database level, but not necessarily between all databases) • Allows workflows to reference by identifier but act appropriately within different contexts. • Allows multiple administrators to define objects without reference to each other ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 8 Grid • Focus on ingest, not discovery (yet) • Instantiate architecture on every node • Assign one node as master, rest as slaves. Master then divides the processing as appropriate. • Calls between slaves possible • Calls as small, simple as possible: (objectIdentifier, functionName, *arguments) • Typically: ('workflow-id', 'process', 'document-id') ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 9 Grid Architecture Master Task (workflow, process, document) (workflow, process, document) fetch document fetch document Data Grid document document Slave Task 1 Slave Task N extracted data extracted data GPFS Temporary Storage ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 10 Grid Architecture - Phase 2 Master Task (index, load) (index, load) store index store index Data Grid Slave Task 1 Slave Task N fetch extracted data fetch extracted data GPFS Temporary Storage ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 11 Workflow Objects • Written as XML within the configuration record. • Rewrites and compiles to Python code on object instantiation Current instructions: – – – – – – – – object assign fork for-each break/continue try/except/raise return log (= send text to default logger object) Yes, no if! ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 12 Workflow example <subConfig id=“buildSingleWorkflow”> <objectType>workflow.SimpleWorkflow</objectType> <workflow> <object type=“workflow” ref=“PreParserWorkflow”/> <try> <object type=“parser” ref=“NsSaxParser”/> </try> <except> <log>Unparsable Record</log> <raise/> </except> <object type=“recordStore” function=“create_record”/> <object type=“database” function=“add_record”/> <object type=“database” function=“index_record”/> <log>”Loaded Record:” + input.id</log> </workflow> </subConfig> ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 13 Text Mining • Integration of Natural Language Processing tools • Including: – – – – Part of Speech taggers (noun, verb, adjective,...) Phrase Extraction Deep Parsing (subject, verb, object, preposition,...) Linguistic Stemming (is/be fairy/fairy vs is/is fairy/fairi) • Planned: Information Extraction tools ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 14 Data Mining • Integration of toolkits difficult unless they support sparse vectors as input - text is high dimensional, but has lots of zeroes • Focus on automatic classification for predefined categories rather than clustering • Algorithms integrated/implemented: – – – – Perceptron, Neural Network (pure python) Naïve Bayes (pure python) SVM (libsvm integrated with python wrapper) Classification Association Rule Mining (Java) ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 15 Data Mining • Modelled as multi-stage PreParser object (training phase, prediction phase) • Plus need for AccumulatingDocumentFactory to merge document vectors together into single output for training some algorithms (e.g., SVM) • Prediction phase attaches metadata (predicted class) to document object, which can be stored in DocumentStore • Document vectors generated per index per document, so integrated NLP document normalization for free ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 16 Data Mining + Text Mining • Testing integrated environment with 500,000 medline abstracts, using various NLP tools, classification algorithms, and evaluation strategies. • Computational grid for distributing expensive NLP analysis • Results show better accuracy with fewer attributes: Vector Source Avg TCV Attributes Accuracy Every word in document 99 85.7% Stemmed words in document 95 86.2% Part of Speech filtered words 69 85.2% Stemmed Part of Speech filtered 65 86.3% Genia filtered 68 85.5% Genia Stem filtered 64 87.2% ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 17 Applications (1) Automated Collection Strength Analysis Primary aim: Test if data mining techniques could be used to develop a coverage map of items available in the London libraries. The strengths within the library collections were automatically determined through enrichment and analysis of bibliographic level metadata records. This involved very large scale processing of records to: – Deduplicate millions of records – Enrich deduplicated records against database of 45 million – Automatically reclassify enriched records using machine learning processes (Naïve Bayes) ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 18 Applications (1) • Data mining enhances collection mapping strategies by making a larger proportion of the data usable, by discovering hidden relationships between textual subjects and hierarchically based classification systems. • The graph shows the comparison of numbers of books classified in the domain of Psychology originally and after enhancement using data mining Records per Library for All of Psychology 5500 5000 4500 4000 3500 3000 Original 2500 Enhanced 2000 1500 1000 500 0 Goldsmiths ISGC 2007 - Taipei, Taiwan Kings Queen Mary Senate UCL Westminster 2007.03.29 SLIDE 19 Applications (2) Assessing the Grade Level of NSDL Education Material • The National Science Digital Library has assembled a collection of URLs that point to educational material for scientific disciplines for all grade levels. These are harvested into the SRB data grid. • Working with SDSC we assessed the grade-level relevance by examining the vocabulary used in the material present at each registered URL. • We determined the vocabulary-based grade-level with the Flesch-Kincaid grade level assessment. The domain of each website was then determined using data mining techniques (TF-IDF derived fast domain classifier). • This processing was done on the Teragrid cluster at SDSC. ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 20 Applications (2) • The formula for the Flesch Reading Ease Score: FRES = 206.835 –1.015 ((total words)/(total sentences)) – 84.6 ((total syllables)/(total words)) • The Flesch-Kincaid Grade Level Formula: FKGLF = 0.39 * ((total words)/(total sentences)) + 11.8 * ((total syllables)/(total words)) –15.59 • The Domain was determined by: – Domains used were based upon the AAAS Benchmarks – Taking in samples from each of the domain areas being examined and produces scored and ranked lists of vocabularies for each domain. – Each token in a document is passed through a lookup function against this table and tallies are calculated for the entire document. – These tallies are then used to rank the order of likelihood of the document being about each topic and a statistical pass of the results returns only those topics that are above in certain threshold. ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 21 Future Plans • IR Testing and Optimization – Work with the OCA Book collection as part of INEX 2007 – TREC, CLEF, and INEX Benchmarking • Integration of Geographic Information Retrieval methods from Cheshire II – GIR Ranking and Gazetteer-based text retrieval using NLP methods • Pattern-driven text mining methods for extracting biographical information from texts – IMLS-funded “Bringing Lives to Light” project ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 22 Overview • Bringing Lives to Light – Focusing on the Who in Who, What, Where and When – Examining and extending of various types of Biographical Markup – Mining biographical data from available information resources to fill our extended markup databases ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 23 WHEN, WHERE and WHO • Catalog records found from a time period search commonly include names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia. ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 24 Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs, Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc. Biographical dictionaries are also heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970. Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else. ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 25 A new form of biographical dictionary would link to all Biographical Dictionary Texts Maps/ Geo Data EVI Thesaurus/ Ontology Gazetteers captions Time Period Directory ISGC 2007 - Taipei, Taiwan Numeric datasets Time lines, Chronologies 2007.03.29 SLIDE 26 “Lives” Projected Work • Develop XML markup for Biographical Events • Most likely to be adaptation and extension of existing biographical event markup – Example: EAC/EAD • Harvest biographical resources – Wikipedia, etc. • Integrate as next generation of current interface ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 27 EAC/EAD <bioghist> <head>Biographical Note</head> <chronlist> <chronitem> <date>1892, May 7</date> <event>Born, <geogname>Glencoe, Ill.</geogname></event> </chronitem> <chronitem> <date>1915</date> <event>A.B., <corpname>Yale University, </corpname>New Haven, Conn.</event> </chronitem> <chronitem> <date>1916</date> <event>Married <persname>Ada Hitchcock</persname> </event> </chronitem> <chronitem> <date>1917-1919</date> <event>Served in <corpname>United States Army</corpname></event> </chronitem> </chronlist> </bioghist> ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 28 Wikipedia data Life events metadata WHAT: Actions prisoner WHERE: Places Holstein WHEN: Times 1261-1262 WHO: People Margaret Sambiria Need external links ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 29 ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 30 A Metadata Infrastructure INTERMEDIA INFRASTRUCTURE Facet Authority Control Special Display Tools RESOURCES CATALOGS WHAT Thesaurus Syndetic Structure Learners WHERE Gazetteer Maps WHEN Time Period Directory Timelines WHO Biographical Dictionary Achives Historical Societies Libraries Museums Public Television Publishers Booksellers Audio Images Numeric Data Objects Texts Virtual Reality Webpages Dossiers ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 31 “Lives” Acknowledgements • Electronic Cultural Atlas Initiative project • This work is being supported supported by the Institute of Museum and Library Services through a National Leadership Grant for Libraries • Contact: [email protected] ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 32 Thank you! Available via http://www.cheshire3.org ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 33

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ApplicationHumanitiesAndSocialScience