* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Keynote ICSD 2009 Digital Libraries and the
Survey
Document related concepts
Transcript
Digital Libraries and the Semantic Web A conceptual framework and an agenda for research and practice Keynote presentation at ICSD 2009 Dagobert Soergel Department of Library and Information Studies Graduate School of Education University at Buffalo Acknowledgments Many of the ideas in this presentation originated from a review of the papers submitted to the International Conference on the Semantic Web and Digital Libraries 2009 (ICSD 2009). So acknowledgments are due to all the paper authors. Soergel, ICSD 2009 Keynote 2 DLs versus SW Digital Libraries Manage, often large, collections of documents and data sets and provide access to these resources and ideally tools to process them. Retrieval often based on words in text. Semantic Web Uses inference over a large distributed storehouse of propositional data, including ontologies, to - answer a question, - derive a problem solution, - devise a plan of action. Soergel, ICSD 2009 Keynote 3 DL ↔ SW DL → SW How can digital libraries support Semantic Web functionality? Generate propositional knowledge, including ontologies, from document corpora through information extraction or statistical methods SW → DL How can Semantic Web technology improve digital libraries? Use semantics to improve retrieval and presentation Towards unified systems Harmonize standards from DLs (and libraries generally) and SW, profiting from the thinking of both communities Soergel, ICSD 2009 Keynote 4 Overview • Information extraction (and it use for ontology creation) • Semantically enriched documents Integrated store of documents, propositions, data sets • Navigation in concept structures and document spaces • Support for learning, sense making, tasks • Schema and ontology creation and mapping Soergel, ICSD 2009 Keynote 5 Information extraction Text High blood pressure is a serious disease often caused by being overweight. In kids 4 – 12 it can be treated highly effectively with Nystatin Formal representation Causation (HighBloodPressure, Obesity) Treatment (HighBloodPressure, {Human, [Age, 4-12y]}, Nystatin, [Effectiveness, 4]) Soergel, ICSD 2009 Keynote 6 Answering questions Question How can high blood pressure be prevented? Answer Loose weight? Soergel, ICSD 2009 Keynote 7 Information extraction Text Kids begin grazing independently from their mothers at three months Formal representation Separation (Mother, Child, {Goat, [Age, 3m]}) Soergel, ICSD 2009 Keynote 8 Automatic information extraction • Find suitable documents or images Highly structured documents (such as dictionaries) and documents containing structured lists (such as a classification of life events) work well • Recognize entities (concepts, named entities) Find the unique identifier for each (from some standard scheme) • Noun phrase and verb phrase identification • Word sense disambiguation, co-reference resolution • Determine relationships, express propositions in formal representation • Much of this requires syntactic and semantic parsing Also recognition or relationships from typographical arrangement • Recognition of propositions not expressed in a single sentence • Deal with negation and other qualifications. Certainty (as expressed in one source) Soergel, ICSD 2009 Keynote 9 Automatic information extraction • Add to proposition store • If proposition already known, just add reference to source • If proposition new, add proposition with its source • Identify relationships between propositions (such as contradictions) • Certainty (from information across sources, considering evidential strength of each source) • Can label proposition as to general origin (language of source document, cultural origin of source document, scholarly / scientific school of source document) • Knowledge in proposition store assists in IE from new documents Soergel, ICSD 2009 Keynote 10 Computer-supported IE • Automatic information extraction is hard, need to supplement with human IE • IE as part of document authoring or during publishing Collaborative IE (crowdsourcing) • Build systems that support the human task Make human IE and semantic enrichment by authors feasible • Person edits results of automatic IE • Person enters free-form proposition, system converts to formal representation, person checks • Reconciliation of differences in results • Computer-supported IE system should learn from changes made by human editor Soergel, ICSD 2009 Keynote 11 Corpus-based information extraction • Find associations in a corpus • Data mining over text corpora or numeric databases • Finding connections between non-overlapping literatures, pioneered by Don Swanson Soergel, ICSD 2009 Keynote 12 Multilingual information extraction • Requires IE tools in multiple languages • Creates proposition store from many sources • Interesting experiment Document exists in two languages Apply IE to both versions and compare results Soergel, ICSD 2009 Keynote 13 IE for Ontology creation • Some extracted propositions can be used as elements of an ontology Discussed later Soergel, ICSD 2009 Keynote 14 Semantic enrichment Soergel, ICSD 2009 Keynote 15 A semantically enriched document Reis et al. (2008) Impact of Environment and Social Gradient on Leptospira infection in Urban Slums (doi:10.1371/journal.pntd.0000228). Infectious disease studied: Leptospirosis Pathogen (causative agent of disease): Leptospira spirochete Vector of disease pathogen: Rat (Rattus norvegicus) Pathogen host subjected to study: Human (Homo sapiens) Number of subject individuals in study: 3,171 ... Purpose of study: Quantify risk factors for leptospirosis . . . Principal finding 1: Prevalence of Leptospira antibodies . . . Principal finding 2: Disease risk . . .open sewers . . . (http://dx.doi.org/10.1371/journal.pntd.0000228.x002) Soergel, ICSD 2009 Keynote 16 A semantically enriched document Tag Trees of Individual Semantic Classes of Highlighted Terms disease infectious diseases diarrheal disease childhood diarrhea dengue leptospirosis human leptospirosis meningococcal disease pulmonary hemorrhage syndrome ID = Infectious Disease Ontology GO = Gene Ontology term used in ID ID:0000012 immunity ID:0000017 mortality ID:0000023 zoonotic ID:0000025 pathogenicity ID:0000034 endemic ID:0000038 parasite ID:0000056 host ID:0000057 carrier ID:0000063 vector ID:0000064 pathogen ID:0000066 infectious agent ID:0000069 primary pathogen ID:0000104 infection visceral leishmaniasis Weil's disease occupational disease zoonotic disease Soergel, ICSD 2009 Keynote 17 ID = Infectious Disease Ontology GO = Gene Ontology IDO:0000000 ! process IDO:0000083 transmission IDO:0000231 horizontal transmission (GO:0000031) IDO:0000104 infection IDO:0000084 pathogenesis IDO:0000221 ! infectious disease progression IDO:0000100 ! pathogen evasion of host immune response IDO:0000111 antigenic variation IDO:0000115 genetic diversificatn IDO:0000226 pathogen life cycle (GO:0000026) IDO:0000001 ! role IDO:0000036 ! colonizer IDO:0000038 parasite IDO:0000048 symptom IDO:0000056 host IDO:0000057 carrier IDO:0000059 reservoir IDO:0000063 vector IDO:0000064 pathogen IDO:0000066 infectious agent IDO:0000069 primary pathogen IDO:0000200 mode of transmission (GO:0000000) IDO:0000002 ! quality Soergel, ICSD 2009 Keynote IDO:0000215 ! quality of host population 18 Semantically enriched documents • Semantic enrichment supports semantic retrieval • Broad area of its own • Many different forms • • • • Explicit document structure Concept and named entity tagging and identification Assigning additional concepts or named entities Assigning extracted propositions • Closely linked with information extraction • IE produces elements of semantic enrichment Soergel, ICSD 2009 Keynote 19 Semantic enrichment through document structure • On a broad level, a document's semantics can be made explicit simply by the internal document structure • Requires a document template or frame for the type of document • Document Structure Ontology with templates / frames for many types of documents, including learning objects. Standards for digital objects • Includes document formats such as MPEG or SCORM Soergel, ICSD 2009 Keynote 20 Template for a research report 1 Background (could also be called Problem) 1.1 General problem area (often including a review of the literature) 1.2 Specific problem. Purpose of the study, question to be answered 2 Methods 2.1 Discussion of the methods used in the study 2.2 Description of the actual conduct of the study 3 Results 4 Conclusions 4.1 Summary of methods and results 4.2 Relationship to existing body of knowledge. 4.3 Implications for decision making and/or further research Soergel, ICSD 2009 Keynote 21 Computer-supported IE • Automatic information extraction is hard, need to supplement with human IE • IE as part of document authoring or during publishing Collaborative IE (crowdsourcing) • Build systems that support the human task Make human IE and semantic enrichment by authors feasible • Person edits results of automatic IE • Person enters free-form proposition, system converts to formal representation, person checks • Reconciliation of differences in results • Computer-supported IE system should learn from changes made by human editor Soergel, ICSD 2009 Keynote 22 Concept and named entity tagging and identification • Includes abstract concepts and named entities such as persons, organizations, places, dates, events, etc. • Identified with reference to some standard scheme, such as a Knowledge Organization System (KOS, includes ontologies, thesauri, etc.) or NE registry. Add identifier as part of the tag • Can tag within text or list separately as metadata (with pointer to the precise piece of the text) Soergel, ICSD 2009 Keynote 23 Additional concepts or named entities • Concepts or named entities that are not designated by a word or phrase in the text but implied by the document as a whole or a passage in it • Assigned through • Statistical automatic classifier • Rule-based inference • Human editor (with ontology-based assistance) • Each concept or NE should be linked to smallest text passage that implies it (may be the whole document) Soergel, ICSD 2009 Keynote 24 Assigning extracted propositions • Allows for more precise retrieval • Example: Precise retrieval of documents on causation is notoriously difficult Does A cause B? What are the effects of A? What causes B? • If propositions of the form A causes B are assigned to the document in semantic enrichment, such searches are possible • Propositions can be transferred to a larger repository (see IE) or be available only through the enriched Web document – they can still be found and used be Semantic Web agents Soergel, ICSD 2009 Keynote 25 Making semantic enrichment available • Documents are enriched from many sources The same document may receive multiple enrichments • Digital libraries and publishers should ensure that a user looking at any copy of a document sees all the semantic enrichments for this document. Soergel, ICSD 2009 Keynote 26 Dual representation of document content • Representation to use same content for two purposes • for people (teach people) • for computer processing (teach computer systems) • How precise is the correspondence? How complete is each representation? • How easy is it to get from one to the other • Information extraction • Text and image generation Text generation in multiple languages – one approach to translation Soergel, ICSD 2009 Keynote 27 Integrated Digital Libraries: Documents + Data + Tools Elements • Semantically enriched documents • Proposition store (including propositions in any Web document) • Data sets • Tools for data analysis and reasoning • All linked together, for example • Drill down from a formally stated proposition to text and to supporting data • Link from text to formal propositions and related texts • Link from data set to suitable data analysis tools Created and maintained collaboratively Example: Neurocommons http://sciencecommons.org/projects/data Soergel, ICSD 2009 Keynote 28 Navigation in concept structures and document/data spaces Soergel, ICSD 2009 Keynote 29 Concept structures • Internally, concept structures are often represented on RDF or OWL • Externally, for the user, they need to be shown in a meaningful representation that reflects concept relationships so the user can understand them and navigate them • Can be trees shown in outline form with cross-references or concept maps • Challenge of producing these automatically Soergel, ICSD 2009 Keynote 30 Concept structure with data Soergel, ICSD 2009 Keynote 31 Document/data spaces • Documents and document passages are related in many ways that can be used for navigation and presentation • Challenge 1. Identify passages in multiple documents and arrange them according to relationships that allow the user to see the whole picture and navigate passages in a meaningful sequence • Challenge 2. Arrange passages to fit into the structure of an argument Soergel, ICSD 2009 Keynote 32 Multi-level topical structure Soergel, ICSD 2009 Keynote 33 Information arranged by role in argument Soergel, ICSD 2009 Keynote 34 Topical relevance typology Function-based Reasoning-based Rhetorical structure Matching topic Evidence (Indirect) Context Comparison Evaluation Method / Solution Purpose/ Goal Generic inference Comparison-based Induction / rule-based Causal-based Transitivity-based Argument structure Grounds Warrants Claim Taxonomy Partonomy Frame-based, etc. Semantic-based (Green & Bean, 1995) Soergel, ICSD 2009 Keynote 35 RST+ Functional Role Matching topic (Direct) . Manifestation . . Image content Image theme Evidence (Indirect) Context . . . . . . . Cause / Effect . . . . Cause Effect / Outcome Explanation (causal) Prediction Comparison Scope Framework Environmental setting Social background Time & sequence Assumption / expectation Biographic information . . By similarity (analogy) / By difference (contrast) By factor that is different Method / Solution . Method / Approach . . Instrument Technique / Style Condition . Helping or hindering factor Evaluation . Significance . . . . . Unconditional Exceptional condition Purpose / Motivation Limitation Criterion / Standard Comparative evaluation Soergel, ICSD 2009 Keynote 36 Functional role: Comparison Comparison . . . . . . . . . . . . . . By similarity vs. By difference (Contrast) . By similarity . . Analogy & metaphor . By difference (Contrast) By factor that is different . Different external factor . . Different time . . Different place . Different participant . . Different actor . . Different subject acted upon . Different act or experience . . Different act . . DifferentSoergel, experience ICSD 2009 Keynote 37 Support for learning, sensemaking, tasks Soergel, ICSD 2009 Keynote 38 Support for learning • Structuring learning objects from small reusable elements • Indexing learning objects so they can be • matched with individual learners • arranged in a meaningful didactic sequence • Automatic composition of learning objects customized for individual learner • Support learner control where appropriate Soergel, ICSD 2009 Keynote 39 Support for learning • Requires specialized document structure ontology • Requires ontologies for • Learner characteristics • Learning objectives • Learning object characteristics that can be used for matching • Types of relationships between learning objects Examples: Prerequisite, elaboration Soergel, ICSD 2009 Keynote 40 Support for learning • Requires domain ontologies adapted for learning and instruction • Show meaningful structures for assimilation by the learner • Support arrangement and sequencing of material to be learned • Tools for ontology construction by the learner, for example, concept maps Active learning, building own structures, constructivist approach Soergel, ICSD 2009 Keynote 41 Sense-making Sense-making is the process of creating an understanding of a problem or task so that further actions may be taken in an informed manner • Sense-making is a pre-requisite for many other tasks such as decision making and problem solving; • Sense-making involves making clear the interrelated concepts and their relationships in a problem or task space. Soergel, ICSD 2009 Keynote 42 Sense-making scenario 1 Intelligence task T1: al-Bashir The US wants to take action to towards a resolution of the Darfur conflict . Al-Bashir, the Sudanese president, is one of the key players in the area who is believed to have significant responsibility for continuous conflicts in the region. The administration needs to know as much as possible about al-Bashir in order to better negotiate with the involved parties and strategize its efforts. Your task is to produce a report that identifies information to assess the influence of al-Bashir and makes recommendations for policy decisions and diplomatic actions. Requested information includes: • key figures, organizations, and countries who have been associated with alBashir; • his rise to power; and • groups who have resisted him and the level of success in their resistance. Could draw concept map drawing on multiple sources (map is for illustration) Soergel, ICSD 2009 Keynote 43 Soergel, ICSD 2009 Keynote 44 Support for sensemaking • Sensemaker needs structures • Inputs to structure-building from documents that give explicit structures, find such documents • Sensemaker looks for data (often information extraction from text) and fits data into structure. If data do not fit, revises structure • Some of this process could be automated as discussed earlier Soergel, ICSD 2009 Keynote 45 Support for tasks • Have system derive solution • Support user in deriving solution • Support for sense-making • Arrange search results by how they relate to the task • Needs ontologies related to task, for example • Ontology of types of tasks/problems • Ontology of tasks/problems and their subtasks/problems • Knowledge base of tasks/problems and solutions (with drilling down to documents) Soergel, ICSD 2009 Keynote 46 Schema and ontology creation and mapping Soergel, ICSD 2009 Keynote 47 KOS/ontologies for SW and DL • Semantic Web is bringing ontology and classification back to retrieval • Ontologies created for the Semantic Web are often more exact than library classifications and thesauri – added value for digital libraries • Also here is the issue of creating universal identifiers for many types of named entities (e.g. OKKAM) Library cataloging rules incorporate much thinking about the form of personal and corporate names • Often inextricably linked with storing propositions about these entities (needed for identification) Soergel, ICSD 2009 Keynote 48 Automatic input to ontology generation • From text • Extract ontological relations such as isa and partOf Can be done one document, even passage, at a time • Statistical association, machine learning, data mining Requires a corpus • From search logs • Identifying patterns in series of successive queries • Statistical association, machine learning, data mining Requires a corpus • Digital libraries supply corpora of texts and queries Soergel, ICSD 2009 Keynote 49 Reuse knowledge in existing KOS • Existing Knowledge Organization Systems (KOS), such as ontologies, library classifications, thesauri, dictionaries contain much intellectual capital that can be reused. Need to find these sources Need tools to exploit this knowledge Soergel, ICSD 2009 Keynote 50 Componential analysis for deriving KOS structure • Also known as facet analysis • Expressing concepts as description logic formulas using primitive concepts is a local operation. The results can be used for deriving global structures • Semantic components can often be discovered by linguistic analysis of concept labels and considering the structure of a scheme like Dewey Decimal Classification Soergel, ICSD 2009 Keynote 51 Human ontology editing • High-quality ontologies and other KOS need human editing • The semantic Web community provides ontology editors, but the standards used do not accommodate all information needed for the functions described above • Need more comprehensive tools • Must support distributed collaborative editing Soergel, ICSD 2009 Keynote 52 Granularity of ontologies • For many retrieval tasks, shades of meaning can be ignored • Sometimes capturing shades of meaning is important, particularly in a multilingual environment • Text generation needs knowledge about subtleties of meaning and usage Soergel, ICSD 2009 Keynote 53 KOS/ontology mapping • Very important for both Digital Libraries and Semantic Web • Includes translation between natural languages • The discussion on methods and tools for ontology creation applies • Aligned corpora are useful • Componential analysis can be used for KOS mapping Soergel, ICSD 2009 Keynote 54 Mapping through a Hub Dewey 387 Water, air, space transportation Hub Water transport LCSH Shipping 386 Inland waterway & ferry transportation Inland water transport Inland water transport 387.5 Ocean transportation Ocean transport Merchant marine Traffic station ⊓ Water transport 386.8 Inland waterway tr. > Ports Traffic station ⊓ Inland water tr. 387.1 Ports Traffic station ⊓ Ocean transport Harbors German Hafen Special case: Schema mapping • Database schemas • Document structures Example: Learning objects structured according to different structures • Map schema, then convert content from one schema to another What information is lost? Soergel, ICSD 2009 Keynote 56 Take-home message Digital Libraries and the Semantic Web are mutually dependent and supportive and convergent Soergel, ICSD 2009 Keynote 57 DL ↔ SW Dagobert Soergel College of Information Studies University of Maryland Department of Library and Information Studies Graduate School of Education University at Buffalo dsoergel @ umd.edu www.dsoergel.com Soergel, ICSD 2009 Keynote 58