Download Keynote ICSD 2009 Digital Libraries and the

Document related concepts

Neuroinformatics wikipedia , lookup

Process tracing wikipedia , lookup

Pattern recognition wikipedia , lookup

Learning theory (education) wikipedia , lookup

Time series wikipedia , lookup

Embodied cognitive science wikipedia , lookup

Upper ontology wikipedia , lookup

Transcript
Digital Libraries
and the Semantic Web
A conceptual framework
and an agenda for research and
practice
Keynote presentation at ICSD 2009
Dagobert Soergel
Department of Library and Information Studies
Graduate School of Education
University at Buffalo
Acknowledgments
Many of the ideas in this presentation originated from a
review of the papers submitted to the
International Conference on the Semantic Web and
Digital Libraries 2009 (ICSD 2009).
So acknowledgments are due to all the paper authors.
Soergel, ICSD 2009 Keynote
2
DLs versus SW
Digital Libraries
Manage, often large, collections of documents and data
sets and provide access to these resources and ideally
tools to process them.
Retrieval often based on words in text.
Semantic Web
Uses inference over a large distributed storehouse of
propositional data, including ontologies, to
- answer a question,
- derive a problem solution,
- devise a plan of action.
Soergel, ICSD 2009 Keynote
3
DL ↔ SW
DL → SW
How can digital libraries support Semantic Web functionality?
Generate propositional knowledge, including ontologies, from
document corpora through information extraction or statistical
methods
SW → DL
How can Semantic Web technology improve digital libraries?
Use semantics to improve retrieval and presentation
Towards unified systems
Harmonize standards from DLs (and libraries generally) and
SW, profiting from the thinking of both communities
Soergel, ICSD 2009 Keynote
4
Overview
• Information extraction (and it use for ontology creation)
• Semantically enriched documents
Integrated store of documents, propositions, data sets
• Navigation in concept structures and document spaces
• Support for learning, sense making, tasks
• Schema and ontology creation and mapping
Soergel, ICSD 2009 Keynote
5
Information extraction
Text
High blood pressure is a serious disease often caused
by being overweight. In kids 4 – 12 it can be treated
highly effectively with Nystatin
Formal representation
Causation (HighBloodPressure, Obesity)
Treatment (HighBloodPressure, {Human, [Age, 4-12y]},
Nystatin, [Effectiveness, 4])
Soergel, ICSD 2009 Keynote
6
Answering questions
Question
How can high blood pressure be prevented?
Answer
Loose weight?
Soergel, ICSD 2009 Keynote
7
Information extraction
Text
Kids begin grazing independently from their mothers at
three months
Formal representation
Separation (Mother, Child, {Goat, [Age, 3m]})
Soergel, ICSD 2009 Keynote
8
Automatic information extraction
•
Find suitable documents or images
Highly structured documents (such as dictionaries) and documents
containing structured lists (such as a classification of life events) work well
•
Recognize entities (concepts, named entities)
Find the unique identifier for each (from some standard scheme)
• Noun phrase and verb phrase identification
• Word sense disambiguation, co-reference resolution
•
Determine relationships, express propositions in formal representation
•
Much of this requires syntactic and semantic parsing
Also recognition or relationships from typographical arrangement
•
Recognition of propositions not expressed in a single sentence
•
Deal with negation and other qualifications.
Certainty (as expressed in one source)
Soergel, ICSD 2009 Keynote
9
Automatic information extraction
• Add to proposition store
• If proposition already known, just add reference to source
• If proposition new, add proposition with its source
• Identify relationships between propositions (such as
contradictions)
• Certainty (from information across sources, considering
evidential strength of each source)
• Can label proposition as to general origin (language of source
document, cultural origin of source document, scholarly / scientific
school of source document)
• Knowledge in proposition store assists in IE from new
documents
Soergel, ICSD 2009 Keynote
10
Computer-supported IE
• Automatic information extraction is hard,
need to supplement with human IE
• IE as part of document authoring or during publishing
Collaborative IE (crowdsourcing)
• Build systems that support the human task
Make human IE and semantic enrichment by authors feasible
• Person edits results of automatic IE
• Person enters free-form proposition, system converts to formal
representation, person checks
• Reconciliation of differences in results
• Computer-supported IE system should
learn from changes made by human editor
Soergel, ICSD 2009 Keynote
11
Corpus-based
information extraction
• Find associations in a corpus
• Data mining over text corpora or numeric databases
• Finding connections between non-overlapping literatures,
pioneered by Don Swanson
Soergel, ICSD 2009 Keynote
12
Multilingual
information extraction
• Requires IE tools in multiple languages
• Creates proposition store from many sources
• Interesting experiment
Document exists in two languages
Apply IE to both versions and compare results
Soergel, ICSD 2009 Keynote
13
IE for Ontology creation
• Some extracted propositions can be used as
elements of an ontology
Discussed later
Soergel, ICSD 2009 Keynote
14
Semantic enrichment
Soergel, ICSD 2009 Keynote
15
A semantically enriched document
Reis et al. (2008)
Impact of Environment and Social Gradient on Leptospira infection in Urban Slums
(doi:10.1371/journal.pntd.0000228).
Infectious disease studied:
Leptospirosis
Pathogen (causative agent of disease): Leptospira spirochete
Vector of disease pathogen:
Rat (Rattus norvegicus)
Pathogen host subjected to study:
Human (Homo sapiens)
Number of subject individuals in study: 3,171
...
Purpose of study:
Quantify risk factors for leptospirosis . . .
Principal finding 1:
Prevalence of Leptospira antibodies . . .
Principal finding 2:
Disease risk . . .open sewers . . .
(http://dx.doi.org/10.1371/journal.pntd.0000228.x002)
Soergel, ICSD 2009 Keynote
16
A semantically enriched document
Tag Trees of Individual Semantic
Classes of Highlighted Terms
disease
infectious diseases
diarrheal disease
childhood diarrhea
dengue
leptospirosis
human leptospirosis
meningococcal disease
pulmonary hemorrhage
syndrome
ID = Infectious Disease Ontology
GO = Gene Ontology term used in ID
ID:0000012 immunity
ID:0000017 mortality
ID:0000023 zoonotic
ID:0000025 pathogenicity
ID:0000034 endemic
ID:0000038 parasite
ID:0000056 host
ID:0000057 carrier
ID:0000063 vector
ID:0000064 pathogen
ID:0000066 infectious agent
ID:0000069 primary pathogen
ID:0000104 infection
visceral leishmaniasis
Weil's disease
occupational disease
zoonotic disease
Soergel, ICSD 2009 Keynote
17
ID = Infectious Disease Ontology GO = Gene Ontology
IDO:0000000 ! process
IDO:0000083 transmission
IDO:0000231 horizontal transmission (GO:0000031)
IDO:0000104 infection
IDO:0000084 pathogenesis
IDO:0000221 ! infectious disease progression
IDO:0000100 ! pathogen evasion of host immune response
IDO:0000111 antigenic variation
IDO:0000115 genetic diversificatn
IDO:0000226 pathogen life cycle (GO:0000026)
IDO:0000001 ! role
IDO:0000036 ! colonizer
IDO:0000038 parasite
IDO:0000048 symptom
IDO:0000056 host
IDO:0000057 carrier
IDO:0000059 reservoir
IDO:0000063 vector
IDO:0000064 pathogen
IDO:0000066 infectious agent
IDO:0000069 primary pathogen
IDO:0000200 mode of transmission (GO:0000000)
IDO:0000002 ! quality Soergel, ICSD 2009 Keynote
IDO:0000215 ! quality of host population
18
Semantically enriched documents
• Semantic enrichment supports semantic retrieval
• Broad area of its own
• Many different forms
•
•
•
•
Explicit document structure
Concept and named entity tagging and identification
Assigning additional concepts or named entities
Assigning extracted propositions
• Closely linked with information extraction
• IE produces elements of semantic enrichment
Soergel, ICSD 2009 Keynote
19
Semantic enrichment
through document structure
• On a broad level, a document's semantics can be made
explicit simply by the internal document structure
• Requires a document template or frame for the type of
document
• Document Structure Ontology with templates / frames
for many types of documents, including learning objects.
Standards for digital objects
• Includes document formats such as MPEG or SCORM
Soergel, ICSD 2009 Keynote
20
Template for a research report
1 Background (could also be called Problem)
1.1
General problem area (often including a review of the literature)
1.2
Specific problem. Purpose of the study, question to be answered
2 Methods
2.1
Discussion of the methods used in the study
2.2
Description of the actual conduct of the study
3 Results
4 Conclusions
4.1
Summary of methods and results
4.2
Relationship to existing body of knowledge.
4.3
Implications for decision making and/or further research
Soergel, ICSD 2009 Keynote
21
Computer-supported IE
• Automatic information extraction is hard,
need to supplement with human IE
• IE as part of document authoring or during publishing
Collaborative IE (crowdsourcing)
• Build systems that support the human task
Make human IE and semantic enrichment by authors feasible
• Person edits results of automatic IE
• Person enters free-form proposition, system converts to formal
representation, person checks
• Reconciliation of differences in results
• Computer-supported IE system should
learn from changes made by human editor
Soergel, ICSD 2009 Keynote
22
Concept and named entity
tagging and identification
• Includes abstract concepts and named entities such
as persons, organizations, places, dates, events, etc.
• Identified with reference to some standard scheme,
such as a Knowledge Organization System (KOS,
includes ontologies, thesauri, etc.) or NE registry.
Add identifier as part of the tag
• Can tag within text or list separately as metadata
(with pointer to the precise piece of the text)
Soergel, ICSD 2009 Keynote
23
Additional concepts
or named entities
• Concepts or named entities that are not designated
by a word or phrase in the text but implied by the
document as a whole or a passage in it
• Assigned through
• Statistical automatic classifier
• Rule-based inference
• Human editor (with ontology-based assistance)
• Each concept or NE should be linked to smallest text
passage that implies it (may be the whole document)
Soergel, ICSD 2009 Keynote
24
Assigning extracted propositions
• Allows for more precise retrieval
• Example: Precise retrieval of documents on causation is
notoriously difficult
Does A cause B?
What are the effects of A?
What causes B?
• If propositions of the form A causes B are assigned to the
document in semantic enrichment, such searches are possible
• Propositions can be transferred to a larger repository (see IE)
or be available only through the enriched Web document –
they can still be found and used be Semantic Web agents
Soergel, ICSD 2009 Keynote
25
Making semantic enrichment
available
• Documents are enriched from many sources
The same document may receive multiple enrichments
• Digital libraries and publishers should ensure that a
user looking at any copy of a document sees all the
semantic enrichments for this document.
Soergel, ICSD 2009 Keynote
26
Dual representation
of document content
• Representation to use same content for two purposes
• for people (teach people)
• for computer processing (teach computer systems)
• How precise is the correspondence?
How complete is each representation?
• How easy is it to get from one to the other
• Information extraction
• Text and image generation
Text generation in multiple languages –
one approach to translation
Soergel, ICSD 2009 Keynote
27
Integrated Digital Libraries:
Documents + Data + Tools
Elements
•
Semantically enriched documents
•
Proposition store (including propositions in any Web document)
•
Data sets
•
Tools for data analysis and reasoning
•
All linked together, for example
• Drill down from a formally stated proposition to text and to
supporting data
• Link from text to formal propositions and related texts
• Link from data set to suitable data analysis tools
Created and maintained collaboratively
Example: Neurocommons http://sciencecommons.org/projects/data
Soergel, ICSD 2009 Keynote
28
Navigation
in concept structures
and document/data spaces
Soergel, ICSD 2009 Keynote
29
Concept structures
• Internally, concept structures are often
represented on RDF or OWL
• Externally, for the user,
they need to be shown in a meaningful representation
that reflects concept relationships
so the user can understand them and navigate them
• Can be trees shown in outline form with cross-references
or concept maps
• Challenge of producing these automatically
Soergel, ICSD 2009 Keynote
30
Concept structure with data
Soergel, ICSD 2009 Keynote
31
Document/data spaces
• Documents and document passages are related in
many ways that can be used for navigation and
presentation
• Challenge 1. Identify passages in multiple documents
and arrange them according to relationships that
allow the user to see the whole picture and navigate
passages in a meaningful sequence
• Challenge 2. Arrange passages to fit into the
structure of an argument
Soergel, ICSD 2009 Keynote
32
Multi-level topical structure
Soergel, ICSD 2009 Keynote
33
Information arranged
by role in argument
Soergel, ICSD 2009 Keynote
34
Topical relevance typology
Function-based
Reasoning-based
Rhetorical structure
Matching topic
Evidence (Indirect)
Context
Comparison
Evaluation
Method / Solution
Purpose/ Goal
Generic inference
Comparison-based
Induction / rule-based
Causal-based
Transitivity-based
Argument structure
Grounds
Warrants
Claim
Taxonomy
Partonomy
Frame-based,
etc.
Semantic-based
(Green & Bean, 1995)
Soergel, ICSD 2009 Keynote
35
RST+ Functional Role
Matching topic (Direct)
. Manifestation
.
.
Image content
Image theme
Evidence (Indirect)
Context
.
.
.
.
.
.
.
Cause / Effect
.
.
.
.
Cause
Effect / Outcome
Explanation (causal)
Prediction
Comparison
Scope
Framework
Environmental setting
Social background
Time & sequence
Assumption / expectation
Biographic information
.
.
By similarity (analogy) /
By difference (contrast)
By factor that is different
Method / Solution
. Method / Approach
.
.
Instrument
Technique / Style
Condition
. Helping or hindering factor
Evaluation
. Significance
.
.
.
.
.
Unconditional
Exceptional condition
Purpose / Motivation
Limitation
Criterion / Standard
Comparative evaluation
Soergel, ICSD 2009 Keynote
36
Functional role: Comparison
Comparison
.
.
.
.
.
.
.
.
.
.
.
.
.
.
By similarity vs. By difference (Contrast)
. By similarity
. . Analogy & metaphor
. By difference (Contrast)
By factor that is different
. Different external factor
. . Different time
. . Different place
. Different participant
. . Different actor
. . Different subject acted upon
. Different act or experience
. . Different act
. . DifferentSoergel,
experience
ICSD 2009 Keynote
37
Support for learning,
sensemaking, tasks
Soergel, ICSD 2009 Keynote
38
Support for learning
• Structuring learning objects from small reusable
elements
• Indexing learning objects so they can be
• matched with individual learners
• arranged in a meaningful didactic sequence
• Automatic composition of learning objects
customized for individual learner
• Support learner control where appropriate
Soergel, ICSD 2009 Keynote
39
Support for learning
• Requires specialized document structure ontology
• Requires ontologies for
• Learner characteristics
• Learning objectives
• Learning object characteristics that can be used
for matching
• Types of relationships between learning objects
Examples: Prerequisite, elaboration
Soergel, ICSD 2009 Keynote
40
Support for learning
• Requires domain ontologies adapted for learning
and instruction
• Show meaningful structures for assimilation by the
learner
• Support arrangement and sequencing of material
to be learned
• Tools for ontology construction by the learner,
for example, concept maps
Active learning, building own structures,
constructivist approach
Soergel, ICSD 2009 Keynote
41
Sense-making
Sense-making is
the process of creating an understanding of a problem or task
so that further actions may be taken in an informed manner
• Sense-making is a pre-requisite for many other tasks such
as decision making and problem solving;
• Sense-making involves making clear the interrelated
concepts and their relationships in a problem or task space.
Soergel, ICSD 2009 Keynote
42
Sense-making scenario 1
Intelligence task T1: al-Bashir
The US wants to take action to towards a resolution of the Darfur conflict . Al-Bashir,
the Sudanese president, is one of the key players in the area who is believed to have
significant responsibility for continuous conflicts in the region. The administration
needs to know as much as possible about al-Bashir in order to better negotiate with
the involved parties and strategize its efforts. Your task is to produce a report that
identifies information to assess the influence of al-Bashir and makes recommendations
for policy decisions and diplomatic actions.
Requested information includes:
• key figures, organizations, and countries who have been associated with alBashir;
• his rise to power; and
• groups who have resisted him and the level of success in their resistance.
Could draw concept map drawing on multiple sources (map is for illustration)
Soergel, ICSD 2009 Keynote
43
Soergel, ICSD 2009 Keynote
44
Support for sensemaking
• Sensemaker needs structures
• Inputs to structure-building from documents that give
explicit structures, find such documents
• Sensemaker looks for data (often information
extraction from text) and fits data into structure.
If data do not fit, revises structure
• Some of this process could be automated as
discussed earlier
Soergel, ICSD 2009 Keynote
45
Support for tasks
• Have system derive solution
• Support user in deriving solution
• Support for sense-making
• Arrange search results by how they relate to the task
• Needs ontologies related to task, for example
• Ontology of types of tasks/problems
• Ontology of tasks/problems and their subtasks/problems
• Knowledge base of tasks/problems and solutions
(with drilling down to documents)
Soergel, ICSD 2009 Keynote
46
Schema and ontology
creation and mapping
Soergel, ICSD 2009 Keynote
47
KOS/ontologies for SW and DL
• Semantic Web is bringing ontology and classification back to
retrieval
• Ontologies created for the Semantic Web are often more exact
than library classifications and thesauri – added value for digital
libraries
• Also here is the issue of creating universal identifiers for many
types of named entities (e.g. OKKAM)
Library cataloging rules incorporate much thinking about the
form of personal and corporate names
• Often inextricably linked with storing propositions about
these entities (needed for identification)
Soergel, ICSD 2009 Keynote
48
Automatic input to
ontology generation
• From text
• Extract ontological relations such as isa and partOf
Can be done one document, even passage, at a time
• Statistical association, machine learning, data mining
Requires a corpus
• From search logs
• Identifying patterns in series of successive queries
• Statistical association, machine learning, data mining
Requires a corpus
• Digital libraries supply corpora of texts and queries
Soergel, ICSD 2009 Keynote
49
Reuse knowledge in existing
KOS
• Existing Knowledge Organization Systems (KOS),
such as ontologies, library classifications, thesauri,
dictionaries contain much intellectual capital that can
be reused.
Need to find these sources
Need tools to exploit this knowledge
Soergel, ICSD 2009 Keynote
50
Componential analysis
for deriving KOS structure
• Also known as facet analysis
• Expressing concepts as description logic formulas
using primitive concepts is a local operation.
The results can be used for deriving global structures
• Semantic components can often be discovered by
linguistic analysis of concept labels and considering
the structure of a scheme like Dewey Decimal
Classification
Soergel, ICSD 2009 Keynote
51
Human ontology editing
• High-quality ontologies and other KOS need human
editing
• The semantic Web community provides ontology
editors, but the standards used do not accommodate
all information needed for the functions described
above
• Need more comprehensive tools
• Must support distributed collaborative editing
Soergel, ICSD 2009 Keynote
52
Granularity of ontologies
• For many retrieval tasks, shades of meaning can be
ignored
• Sometimes capturing shades of meaning is
important, particularly in a multilingual environment
• Text generation needs knowledge about subtleties of
meaning and usage
Soergel, ICSD 2009 Keynote
53
KOS/ontology mapping
• Very important for both Digital Libraries and Semantic
Web
• Includes translation between natural languages
• The discussion on methods and tools for ontology
creation applies
• Aligned corpora are useful
• Componential analysis can be used for KOS mapping
Soergel, ICSD 2009 Keynote
54
Mapping through a Hub
Dewey
387 Water, air, space transportation
Hub
Water transport
LCSH
Shipping
386 Inland waterway & ferry transportation
Inland water transport
Inland water transport
387.5 Ocean transportation
Ocean transport
Merchant marine
Traffic station ⊓ Water transport
386.8 Inland waterway tr. > Ports
Traffic station ⊓ Inland water tr.
387.1 Ports
Traffic station ⊓ Ocean transport
Harbors
German
Hafen
Special case: Schema mapping
• Database schemas
• Document structures
Example: Learning objects structured according to
different structures
• Map schema, then convert content from one schema
to another
What information is lost?
Soergel, ICSD 2009 Keynote
56
Take-home message
Digital Libraries and the Semantic Web are
mutually dependent and supportive and
convergent
Soergel, ICSD 2009 Keynote
57
DL ↔ SW
Dagobert Soergel
College of Information Studies
University of Maryland
Department of Library and Information Studies
Graduate School of Education
University at Buffalo
dsoergel @ umd.edu
www.dsoergel.com
Soergel, ICSD 2009 Keynote
58