Download Dejing Dou s Colloquium Talk (Sept. 15) - Computer Science

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Dejing Dou
Computer and Information Science
University of Oregon, Eugene, Oregon
September, 2010@ Kent State University
1
Where is Eugene, Oregon?
Outline
 Introduction
 Ontology and the Semantic Web
 Biomedical Ontology Development
 Challenges for Data-driven Approaches
 The NEMO Project
 Mining ERP Ontologies (KDD’07)
 Modeling NEMO Ontology Databases (SSDBM’08,
JIIS’10)
 Mapping ERP Metrics (PAKDD’10)
 Ongoing Work
3
What is Ontology?
Formal specification of a vocabulary of
domain concepts and relationships
relating them .
4
A Genealogy Ontology
Individual
Gender
sex
birth
childIn
Event
husband
Family
Male
wife
Female
marriage
BirthEvent
MarriageEvent
divorce
DeathEvent
DivorceEvent

Classes: Individual, Male, Female, Family, MarriageEvent…

Properties: sex, husband, wife, birth……
 Axioms: If there is a MarriageEvent, there will be a Family
related to the husband and wife properties.

Ontology languages: OWL, KIF, OBO …
5
Current WWW
 The majority of data resources in WWW are in human readable
format only (e.g. HTML).
human
WWW
6
The Semantic Web
 One major goal of the Semantic Web is that web-based agents
can process and “understand” data[Berners-Lee et al 2001].
 Ontologies formally describe the semantics of data and webbased agents can take web documents (e.g. in RDF, OWL) as a
set of assertions and draw inferences from them.
Web-based
agents
human
SW
7
Biomedical
Ontologies
 The Gene Ontology (GO): to standardize the formal
representation of gene and gene product attributes across
all species and gene databases (e.g., zebrafish, mouse, fruit
fly)
 Classes: cellular component, molecular function, biological
process, … Properties: is_a, part_of
 The Unified Medical Language System (UMLS): a
comprehensive thesaurus and ontology of biomedical
concepts.
 The National Center of Biomedical Ontology (NCBO) at
Stanford University
 >200 ontologies (hundreds to thousands concepts each one)
4 millions of mappings.
8
Biomedical Ontology Development
 Typically Knowledge Driven: top down process
 Some basic steps and principles:
 Discussions among domain experts and ontology engineers
 Select basic (root) classes and properties (i.e., terms)
 Go to deeper depth for sub-concepts and relationships.




Modularization may be considered if the ontology is expected
to be large.
Add constraints (axioms)
Add unique IDs (e.g., URLs) and textual definitions for terms
Consistency checking
Updating and Evolution (e.g., GO is updated every 15 minutes)
9
Challenges:
 Knowledge Sharing does not help Data Sharing
Automatically
 Annotation (like tags) helps Search in text (e.g., papers), but
not good for experimental data (e.g., numerical values)
 Three main challenges for knowledge/data sharing:
 Heterogeneity: different labs use different analysis
methods, spreadsheet attributes , DB schemas.
 Reusability: knowledge mined from different
experimental data may not be consistent and sharable
 Scalability: the size of experimental data grow much
larger than the size of ontologies. Ontology-based
10
reasoning (e.g., ABox) for large size data is a headache.
Case Study: EEG data
 Electroencephalogram (EEG) data
 Observing Brain Functions through EEG
•Brain activity occurs in cortex and
cortex activity generates scalp EEG
•EEG data (dense-array, 256 channels)
has high temporal (1msec) / poor spatial
resolution (2D), MR imaging (fMRI,
PET) has good spatial (3D) / poor
temporal resolution (~1.0 sec)
11
ERP data and Pattern Analysis
 Event-related potentials (ERP) are created by averaging across
segments of EEG data in different trials and time-locking (e.g.,
every 2 seconds) to stimulus events or response.
(A) 128-channel ERPs to visual word and nonword stimuli. (B) Time course for
P100 pattern by PCA. (C) Scalp topography (spatial distribution) of P100 pattern.
 Some existing tools (e.g., Net Station, EEGLAB, APECS, the Dien
PCA Toolbox) can process ERP data and do pattern analysis.
12
NEMO: NeuroElectroMagnetic Ontologies
 Some challenges in ERP study
 Patterns can be difficult to identify and definitions vary across
research labs. Methods for ERP analysis differ across research
sites.
 It is hard to compare and share the results across experiments
and across labs.
 The NEMO (NeuroElectroMagnetic Ontologies) project
is to address those challenges by developing ontologies
to support ERP data and pattern representation, sharing
and meta-analysis. It has been funded by the NIH as an
R01 project since 2009.
13
Architecture
14
Progress in Data Driven Approaches
 Mining ERP Ontologies (KDD’07) -- Reusability
 Modeling NEMO Ontology Databases (SSDBM’08,
JIIS’10) -- Scalability
 Mapping ERP Metrics (PAKDD’10) -- Heterogeneity
15
Ontology Mining
 Ontology mining is a process for learning an ontology,
including classes, class taxonomy, properties and axioms, from
data.
 Existing ontology mining approaches focus on text mining or
web mining (web content, usage, structure, user profiles).
 Clustering and association rule mining have been used for classes and
properties. [Li&Zhong @ TKDE 18(4), Maedche&Staab @ EKAW’00,
Reinberger et al @ ODBASE’03].
 NetAffix Gene ontology mining tool is applied to microarray data [Cheng
et al @ Bioinformatics 20 (9)]
 Our approach includes hierarchical clustering and classification
for mining class taxonomy, properties and axioms of the firstgeneration of ERP data-specific ontology from spreadsheets, which
is novel.
16
Knowledge Reuse in KDD
?
Lack of formal
Semantics
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
17
Our Framework (KDD’07)
A semi-automatic framework for mining ontologies
18
Four General Procedures
 Classes <= Clustering-based Classification
 Class Taxonomy <= Hierarchical Clustering
 Properties <= Classification
 Axioms <= Association Rule Mining and Classification
19
Experiments on ERP Data
 Preprocessing Data with Temporal PCA
 Mining ERP Classes with Clustering-based Classification
 Mining ERP Class Taxonomy with Hierarchical Clustering
 Mining Properties and Axioms (Rules) with Classification
 Discovering Axioms among Properties with Association
Rules Mining
20
Input Raw ERP data
Subject
Condition
Channel#
Time1(µv)
Time2(µv)
Time3(µv)
Time4(µv)
Time5(µv)
Time6(µv)
S01
A
1
0.077
0.136
0.075
0.095
0.188
0.097
S01
A
2
0.891
1.780
0.895
0.805
1.612
0.813
S01
A
3
0.014
0.018
0.013
0.040
0.066
0.035
S01
A
4
0.657
1.309
0.657
0.789
1.571
0.785
S01
A
5
0.437
0.864
0.432
1.007
2.002
1.003
S01
B
1
0.303
0.603
0.303
0.128
0.250
0.123
S01
B
2
0.477
0.951
0.483
0.418
0.841
0.418
S01
B
3
0.538
0.073
0.038
0.029
0.043
0.022
S01
B
4
0.509
1.061
0.533
0.628
1.254
0.626
S01
B
5
1.497
1.024
0.510
0.218
0.434
0.219
S02
A
1
1.275
2.987
1.500
0.382
0.769
0.386
S02
A
2
0.666
2.555
1.281
0.326
0.648
0.329
S02
A
3
0.673
1.321
0.666
1.026
2.051
1.029
S02
A
4
0.284
1.341
0.678
1.966
3.914
1.966
S02
A
5
0.980
0.564
0.292
0.511
1.012
0.507
S02
B
1
0.367
1.960
0.978
1.741
3.486
1.739
S02
B
2
0.864
0.721
0.365
1.470
2.934
1.472
S02
B
3
0.568
1.729
0.866
1.342
2.680
1.337
S02
B
4
0.149
1.134
0.575
0.210
0.423
0.215
S02
B
5
0.042
0.287
0.151
0.433
0.860
0.433
Sampling rate: 250Hz for 1500ms (375 samples)
Experiment 1-2: 89 subjects and 6 experiment conditions
Experiment 3: 36 subjects and 4 experiment conditions
21
Data Preprocessing (1)
 Temporal PCA Decomposition
PCA
+
component 1 + component 2
=
= complex waveform
PCA extracts as many factors (components) as there are
variables (i.e., number of samples). We retain the first 15 PCA
factors, accounting for most of variances (> 75%). The
remaining factors are assumed to contain “noise”.
22
Data Preprocessing (2)
 Intensity, spatial, temporal and functional metrics
(attributes) for each factor
23
ERP Factors after PCA Decomposition
TI-max
(µs)
IN-mean
(ROI) (µv)
IN-mean
(ROCC) (µv)
...
SP-min
(channel#)
128
4.2823
4.7245
…
24
96
1.2223
1.3955
…
62
164
-6.6589
-4.7608
…
59
220
-3.635
-2.0782
…
58
244
-0.81322
0.29263
…
65
For Experiment 1 data, number of Factors = (474) (594)
For Experiment 2 data, number of Factors = (588) (598)
For Experiment 3 data, number of Factors = 708
24
Mining ERP Classes with Clustering (1)
 We use EM (Expectation-Maximization) clustering
 E.g. for Experiment 1 group 2 data
Cluster/
Pattern
P100
0
1
2
3
0
76
0
2
N100
117
1
0
54
lateN1/N
2
P300
13
14
0
104
0
61
110
42
25
Mining ERP Classes with Clustering (2)
 We use OWL to represent ERP Classes
26
Mining ERP Class Taxonomy with Hierarchical Clustering
 We use EM clustering in both divisive and
agglomerative ways.
 E.g. for Experiment 3 data
27
Mining ERP Class Taxonomy with Hierarchical Clustering
 We use OWL to represent class taxonomy
28
Mining Properties and Axioms with Clustering-based
Classification (1)
 We use decision tree learning (C4.5) to do classification with
the training data labeled by clustering results.
29
Mining Properties and Axioms with Clustering-based
Classification (2)
 We use OWL to represent datatype properties which are based
on those attributes with high information gain (e.g., top 6).
30
Mining Properties and Axioms with Clustering-based
Classification (3)
 We use SWRL to represent axioms.
In FOL:
31
Discovering Axioms among Properties with Association
Rule Mining
 We use Apriori algorithm to find association rules among
properties. The split points are determined by classification
rules. In FOL, they looks like:
32
Rule Optimization
 Idea: (A → B)  (A  B → C) => (A → C)
And
33
A Partial View of the Mined ERP Data Ontology
• Our first-generation ERP ontology consists of 16 classes, 57
properties and 23 axioms.
34
Ontology-based Data Modeling (SSDBM’08, JIIS’10)
 In general, ontologies can be treated as one kind of
conceptual model. Considering the size of data (e.g.,
PCA factors) can be large, instead of building a
knowledge base to store those data, we propose to use
relational databases.
 We designed database schemas based on our ERP
ontologies which include temporal, spatial and
functional concepts.
35
Ontology Databases
Class
Relation
Datat
ype
Datat
ype
Axioms
keys
Objects
constraints
Facts
Now we have bridged these.
triggers
tuples
Ontology Databases
Class
Relation
Datat
ype
Datat
ype
Axioms
keys
Objects
constraints
Facts
views
triggers
tuples
Loading time in Lehigh
University Benchmark
Load Time (1.5
million facts)
(10 Universities, 20 Departments)
Query time
Query Performance
(logarithmic time)
Ontology-based Data Modeling
 For example, especially for the important subsumption
axioms (e.g., subclassof ) of the current ERP ontologies,
we use SQL Triggers and Foreign-Keys to represent
them.
40
Ontology-based Data Modeling
The ER Diagram for the ERP ontology database shows tables
(boxes) and foreign key constraints (arrows). The concepts
pattern, factor, and channel are most densely connected
41
42
NEMO Data Mapping (PAKDD’10)
 Motivation
 Lack of meta-analysis across experiment
because different labs may use different metrics
 Goal of the study
 Mapping alternative sets of ERP spatial and
temporal metrics
Problem definition
Alternative sets of ERP metrics
Challenges
 Semi-structured data
 Uninformative column
headers (string similarity
matching does not work)
 Numerical values
Grouping and reordering
Grouping and reordering
Sequence post-processing
Cross-spatial Join
Metric Set1
Metric Set2
 Process all point-
sequence curves
 Calculate Euclidean
distance between
sequences in the
Cartesian product set
(Cross-spatial join)
●●●
Cross-spatial Join
Assumptions and Heuristics
 The two datasets contain the same or similar ERP
patterns if they are from the same paradigms (e.g.,
oddball in visual/audio - watching or listening
uncommon or fake words among common words)
Gold standard mapping falls along the diagonal cells
Wrong Mappings.
Precision = 9/13
Experiment
 Design of experiment data
 2 simulated “subject groups” (samples)
 SG1 = sample 1
 SG2 = sample 2
 2 data decompositions
 tPCA = temporal PCA decomposition
 sICA = spatial ICA (Independent Component Analysis)
decomposition
 2 sets of alternative metrics
 m1 = metric set 1
 m2 = metric set 2
Experiment Result
Overall Precision: 84.6%
NEMO Related Ongoing Work
 Application of our framework to other domain
 microRNA, medical informatics, gene databases,
 Mapping discovery and integration across ontologies
related to different modalities (e.g., EEG vs. fMRI).
55
Joint EEG-fMRI Data Mapping
56
Joint work with:
Gwen Frishkoff, Jiawei Rong,
Robert Frank, Paea LePendu,
Haishan Liu, Allen Malony, and
Don Tucker 3,4
57
Thanks for your attention !
Any Question?
58