Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Metadata as Infrastructure for Information Retrieval and Text Mining Prof. Ray R. Larson University of California, Berkeley School of Information March 2006 NaCTeM – Ray R. Larson Overview Metadata as Infrastructure – What, Where, When and Who? What are Entry Vocabulary Indexes? – Notion of an EVI – How are EVIs Built Time Period Directories – Mining Metadata for new metadata March 2006 NaCTeM – Ray R. Larson Metadata as Infrastructure The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How? March 2006 NaCTeM – Ray R. Larson Metadata as Infrastructure The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who. The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library. March 2006 NaCTeM – Ray R. Larson What? Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents. Two kinds of mapping in every search: • Documents are assigned to topic categories, e.g. Dewey • Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers. Also mapping between topic systems, e.g. US Patent classification and International Patent Classification. March 2006 NaCTeM – Ray R. Larson ‘What’ searches involve mapping to controlled vocabularies Thesaurus/ Ontology Texts March 2006 NaCTeM – Ray R. Larson Start with a collection of documents. March 2006 NaCTeM – Ray R. Larson Classify and index with controlled vocabulary Index Or use a preindexed collection. March 2006 NaCTeM – Ray R. Larson For: “Wirtschaftspolitik” Problem: Controlled Index Vocabularies can be difficult for people to use. In Library of Congress subj Use: “Economic Policy” “pass mtr veh spark ign eng” March 2006 NaCTeM – Ray R. Larson Solution: Entry Level Vocabulary Index Indexes. pass mtr veh spark ign eng” March 2006 EVI = “Automobile” NaCTeM – Ray R. Larson “What” and Entry Vocabulary Indexes EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents… March 2006 NaCTeM – Ray R. Larson Building and Searching EVIs Domains to select from: Engineering, Medicine, Biology, Social science, etc. User selects a subject domain of interest. Has an Entry Vocabulary Module been built? User has question but is unfamiliar with the domain he wants to search. YES Use an existing EVI. NO Download a set of training data. Extract terms (words and noun phrases) from titles and abstracts. Build associations between extracted terms & controlled vocabularies. Map user’s query to ranked list of controlled vocabulary terms For noun phrases Internet DB indexed with a controlled vocabulary. Part of speech tagging Building an Entry Vocabulary Module (EVI) March 2006 NaCTeM – Ray R. Larson User selects search terms from the ranked list of terms returned by the EVI. Searching Technical Details Download a set of training data. Extract terms (words and noun phrases) from titles and abstracts. Build associations between extracted terms & controlled vocabularies. For noun phrases Internet DB indexed with a controlled vocabulary. Part of speech tagging Building an Entry Vocabulary Module (EVI) March 2006 NaCTeM – Ray R. Larson Association Measure t ¬t C a c ¬C b d Where t is the occurrence of a term and C is the occurrence of a class in the training set March 2006 NaCTeM – Ray R. Larson Association Measure Maximum Likelihood ratio W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p) a and p1= a+b c p2=c+d Vis. Dunning March 2006 NaCTeM – Ray R. Larson a+c p= a+b+c+d Alternatively Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion March 2006 NaCTeM – Ray R. Larson Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Digital library resources Statistical association W(c, t) 2[logL(p 1 , a, a b) ... March 2006 NaCTeM – Ray R. Larson EVI example User Query “Automobile” EVI 1 EVI 2 Index term: “pass mtr veh spark ign eng” Index term: “automobiles” OR March 2006 NaCTeM – Ray R. Larson “internal combustible engines” But why stop there? Index EVI March 2006 NaCTeM – Ray R. Larson Index “Which EVI do I use?” EVI Index EVI Index EVI Index March 2006 NaCTeM – Ray R. Larson Index EVI to EVIs EVI EVI2 Index EVI Index EVI Index March 2006 NaCTeM – Ray R. Larson Why not treat language the same way? In Arabic Find Plutonium March 2006 Chinese Greek Japanese Korean Russian Tamil NaCTeM – Ray R. Larson It is also difficult to move between different media forms Texts EVI Thesaurus/ Ontology Numeric datasets March 2006 NaCTeM – Ray R. Larson Searching across data types Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results March 2006 NaCTeM – Ray R. Larson But texts associated with numeric data can be mapped as well… Texts EVI Thesaurus/ Ontology EVI captions March 2006 NaCTeM – Ray R. Larson Numeric datasets EVI to Numeric Data example 1 2 search interface 1 10 numeric table 11 search interface 2 March 2006 3 EVI LCSH 9 4 online catalog 5 search results captions 8 7 numeric database new query NaCTeM – Ray R. Larson 6 marc But there are also geographic dependencies… Texts EVI Thesaurus/ Ontology EVI Maps/ Geo Data March 2006 captions NaCTeM – Ray R. Larson Numeric datasets WHERE: Place names are problematic… Variant forms: St. Petersburg, Санкт Петербург, Saint-Pétersbourg, . . . Multiple names: Cluj, in Romania / Roumania / Rumania, is also called Klausenburg and Kolozsvar. Names changes: Bombay Mumbai. Homographs:Vienna, VA, and Vienna, Austria; – 50 Springfields. Anachronisms: No Germany before 1870 Vague, e.g. Midwest, Silicon Valley Unstable boundaries: 19th century Poland; Balkans; USSR Use a gazetteer! March 2006 NaCTeM – Ray R. Larson WHERE. Geo-temporal search interface. Place names found i documents. Gazetteer provided lat. & long. Places displayed on map. Timebar March 2006 NaCTeM – Ray R. Larson Zoom on map. Click on place for a list of records. Click on record to display text. March 2006 NaCTeM – Ray R. Larson Catalogs and gazetteers should talk to each other! Catalog search Gazetteer search Geographic sort / display of catalog search result. March 2006 NaCTeM – Ray R. Larson So geographic search becomes part of the infrastructure Texts Maps/ Geo Data March 2006 EVI Thesaurus/ Ontology Gazetteers captions NaCTeM – Ray R. Larson Numeric datasets WHEN: Search by time is also weakly supported… Calendars are the standard for time But people use the names of events to refer to time periods Named time periods resemble place names in being: – Unstable: European War, Great War, First World War – Multiple: Second World War, Great Patriotic War – Ambiguous: “Civil war” in different centuries in England, USA, Spain, etc. Places have temporal aspects & periods have geographical aspects: When the Stone Age was, varies by region March 2006 NaCTeM – Ray R. Larson Similarity between place names and period names Suggests a similar solution: A gazetteer-like Time Period Directory. Gazetteer: – Place name – Type – Spatial markers (Lat & long) -- When Time Period Directory: – Period name – Type – Time markers (Calendar) – Where Note the symmetry in the connections between Where and When. March 2006 NaCTeM – Ray R. Larson Solution - Time Period Directories Initial development involved mining the Library of Congress Subject Authority file for named time periods… March 2006 NaCTeM – Ray R. Larson LC MARC Authorities Records <USMARC> <Fld001>sh 00000613 </Fld001> <Fld151><a>Magdeburg (Germany)</a><x>History</x><y>Siege, 15501551</y></Fld151> <Fld550><w>g</w><a>Sieges</a><z>Germany</z></Fld550> <Fld670><a>Work cat.: 45053442: Besselmeier, S. Warhafftige history vnd beschreibung des Magdeburgischen Kriegs, 1552.</a></Fld670> <Fld670><a>Cath. encyc.</a><b>(Magdeburg: besieged (155051) by the Margrave Maurice of Saxony)</b></Fld670> <Fld670><a>Ox. encyc. reformation</a><b>(Magdeburg: ... during the 1550-1551 siege of Magdeburg ...)</b></Fld670> </USMARC> March 2006 NaCTeM – Ray R. Larson timePeriodEntry Time Period Directory Instance Contains components described below - periodID Unique identifier - periodName Period name, can be repeated for alternative names Information about language, script, transliteration scheme Source information and notes (where was the period name mentioned) - descriptiveNotes Description of time period - dates Calendar and date format Begin & end date (exact, earliest, latest, most-likely, advocated-by- source, ongoing) Notes, sources - periodClassification Period type, e.g. Period of Conflict, Art movement Can plug in different classification schemes Can be repeated for several classifications - location Associated places with time period Contains both place name and entry to a gazetteer providing more specific place information like latitude / longitude coordinates Can plug in different location indicators (e.g. ADL gazetteer, Getty Thesaurus of Geographic names) Recently added coordinates for direct use - relatedPeriod Related time periods periodID of related periods Information about relationship type (part-of, successor etc.) Can plug in different relationship type schemes - entryMetadata Notes about creator / creation of instance Entry date March 2006 NaCTeM – Ray R. Larson March 2006 NaCTeM – Ray R. Larson Time periods by named location March 2006 NaCTeM – Ray R. Larson Catalog Search Result March 2006 NaCTeM – Ray R. Larson Web Interface - Access by map March 2006 NaCTeM – Ray R. Larson Zoomable interface gives access to geographically focused info… March 2006 NaCTeM – Ray R. Larson Web Interface - Access by timeline Link initiates search of the Library of Congress catalog for all records relating to this time period. March 2006 NaCTeM – Ray R. Larson WHEN and WHAT These named time periods are derived from Library of Congress catalog subject headings and so can be used for catalog searching which finds books on topics important for that time period March 2006 NaCTeM – Ray R. Larson Time period directories link via the place (or time) Texts Maps/ Geo Data EVI Thesaurus/ Ontology Gazetteers captions Time Period Directory March 2006 Numeric datasets Time lines, Chronologies NaCTeM – Ray R. Larson WHEN, WHERE and WHO Catalog records found from a time period search commonly include names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia. March 2006 NaCTeM – Ray R. Larson Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs, Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc. Biographical dictionaries are heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970. Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else. March 2006 NaCTeM – Ray R. Larson A new form of biographical dictionary would link to all Biographical Dictionary Texts Maps/ Geo Data EVI Thesaurus/ Ontology Gazetteers captions Time Period Directory March 2006 Numeric datasets Time lines, Chronologies NaCTeM – Ray R. Larson A Metadata Infrastructure INTERMEDIA INFRASTRUCTURE Facet Authority Control Special Display Tools RESOURCES CATALOGS WHAT Thesaurus Syndetic Structure Learners WHERE Gazetteer Maps WHEN Time Period Directory Timelines WHO Biographical Dictionary Text and Images Dossiers March 2006 NaCTeM – Ray R. Larson Achives Historical Societies Libraries Museums Public Television Publishers Booksellers Audio Images Numeric Data Objects Texts Virtual Reality Webpages Acknowledgements Electronic Cultural Atlas Initiative project This work was partially supported by the Institute of Museum and Library Services through a National Leadership Grant for Libraries, award number LG-02-04-0041-04, Oct 2004 - Sept 2006 entitled “Supporting the Learner: What, Where, When and Who” – See: http://ecai.org/imls2004 Michael Buckland, Fred Gey, Vivien Petras, Matt Meiske, Kim Carl Contact: [email protected] March 2006 NaCTeM – Ray R. Larson