Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University 1 Where is Eugene, Oregon? Outline Introduction Ontology and the Semantic Web Biomedical Ontology Development Challenges for Data-driven Approaches The NEMO Project Mining ERP Ontologies (KDD’07) Modeling NEMO Ontology Databases (SSDBM’08, JIIS’10) Mapping ERP Metrics (PAKDD’10) Ongoing Work 3 What is Ontology? Formal specification of a vocabulary of domain concepts and relationships relating them . 4 A Genealogy Ontology Individual Gender sex birth childIn Event husband Family Male wife Female marriage BirthEvent MarriageEvent divorce DeathEvent DivorceEvent Classes: Individual, Male, Female, Family, MarriageEvent… Properties: sex, husband, wife, birth…… Axioms: If there is a MarriageEvent, there will be a Family related to the husband and wife properties. Ontology languages: OWL, KIF, OBO … 5 Current WWW The majority of data resources in WWW are in human readable format only (e.g. HTML). human WWW 6 The Semantic Web One major goal of the Semantic Web is that web-based agents can process and “understand” data[Berners-Lee et al 2001]. Ontologies formally describe the semantics of data and webbased agents can take web documents (e.g. in RDF, OWL) as a set of assertions and draw inferences from them. Web-based agents human SW 7 Biomedical Ontologies The Gene Ontology (GO): to standardize the formal representation of gene and gene product attributes across all species and gene databases (e.g., zebrafish, mouse, fruit fly) Classes: cellular component, molecular function, biological process, … Properties: is_a, part_of The Unified Medical Language System (UMLS): a comprehensive thesaurus and ontology of biomedical concepts. The National Center of Biomedical Ontology (NCBO) at Stanford University >200 ontologies (hundreds to thousands concepts each one) 4 millions of mappings. 8 Biomedical Ontology Development Typically Knowledge Driven: top down process Some basic steps and principles: Discussions among domain experts and ontology engineers Select basic (root) classes and properties (i.e., terms) Go to deeper depth for sub-concepts and relationships. Modularization may be considered if the ontology is expected to be large. Add constraints (axioms) Add unique IDs (e.g., URLs) and textual definitions for terms Consistency checking Updating and Evolution (e.g., GO is updated every 15 minutes) 9 Challenges: Knowledge Sharing does not help Data Sharing Automatically Annotation (like tags) helps Search in text (e.g., papers), but not good for experimental data (e.g., numerical values) Three main challenges for knowledge/data sharing: Heterogeneity: different labs use different analysis methods, spreadsheet attributes , DB schemas. Reusability: knowledge mined from different experimental data may not be consistent and sharable Scalability: the size of experimental data grow much larger than the size of ontologies. Ontology-based 10 reasoning (e.g., ABox) for large size data is a headache. Case Study: EEG data Electroencephalogram (EEG) data Observing Brain Functions through EEG •Brain activity occurs in cortex and cortex activity generates scalp EEG •EEG data (dense-array, 256 channels) has high temporal (1msec) / poor spatial resolution (2D), MR imaging (fMRI, PET) has good spatial (3D) / poor temporal resolution (~1.0 sec) 11 ERP data and Pattern Analysis Event-related potentials (ERP) are created by averaging across segments of EEG data in different trials and time-locking (e.g., every 2 seconds) to stimulus events or response. (A) 128-channel ERPs to visual word and nonword stimuli. (B) Time course for P100 pattern by PCA. (C) Scalp topography (spatial distribution) of P100 pattern. Some existing tools (e.g., Net Station, EEGLAB, APECS, the Dien PCA Toolbox) can process ERP data and do pattern analysis. 12 NEMO: NeuroElectroMagnetic Ontologies Some challenges in ERP study Patterns can be difficult to identify and definitions vary across research labs. Methods for ERP analysis differ across research sites. It is hard to compare and share the results across experiments and across labs. The NEMO (NeuroElectroMagnetic Ontologies) project is to address those challenges by developing ontologies to support ERP data and pattern representation, sharing and meta-analysis. It has been funded by the NIH as an R01 project since 2009. 13 Architecture 14 Progress in Data Driven Approaches Mining ERP Ontologies (KDD’07) -- Reusability Modeling NEMO Ontology Databases (SSDBM’08, JIIS’10) -- Scalability Mapping ERP Metrics (PAKDD’10) -- Heterogeneity 15 Ontology Mining Ontology mining is a process for learning an ontology, including classes, class taxonomy, properties and axioms, from data. Existing ontology mining approaches focus on text mining or web mining (web content, usage, structure, user profiles). Clustering and association rule mining have been used for classes and properties. [Li&Zhong @ TKDE 18(4), Maedche&Staab @ EKAW’00, Reinberger et al @ ODBASE’03]. NetAffix Gene ontology mining tool is applied to microarray data [Cheng et al @ Bioinformatics 20 (9)] Our approach includes hierarchical clustering and classification for mining class taxonomy, properties and axioms of the firstgeneration of ERP data-specific ontology from spreadsheets, which is novel. 16 Knowledge Reuse in KDD ? Lack of formal Semantics Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 17 Our Framework (KDD’07) A semi-automatic framework for mining ontologies 18 Four General Procedures Classes <= Clustering-based Classification Class Taxonomy <= Hierarchical Clustering Properties <= Classification Axioms <= Association Rule Mining and Classification 19 Experiments on ERP Data Preprocessing Data with Temporal PCA Mining ERP Classes with Clustering-based Classification Mining ERP Class Taxonomy with Hierarchical Clustering Mining Properties and Axioms (Rules) with Classification Discovering Axioms among Properties with Association Rules Mining 20 Input Raw ERP data Subject Condition Channel# Time1(µv) Time2(µv) Time3(µv) Time4(µv) Time5(µv) Time6(µv) S01 A 1 0.077 0.136 0.075 0.095 0.188 0.097 S01 A 2 0.891 1.780 0.895 0.805 1.612 0.813 S01 A 3 0.014 0.018 0.013 0.040 0.066 0.035 S01 A 4 0.657 1.309 0.657 0.789 1.571 0.785 S01 A 5 0.437 0.864 0.432 1.007 2.002 1.003 S01 B 1 0.303 0.603 0.303 0.128 0.250 0.123 S01 B 2 0.477 0.951 0.483 0.418 0.841 0.418 S01 B 3 0.538 0.073 0.038 0.029 0.043 0.022 S01 B 4 0.509 1.061 0.533 0.628 1.254 0.626 S01 B 5 1.497 1.024 0.510 0.218 0.434 0.219 S02 A 1 1.275 2.987 1.500 0.382 0.769 0.386 S02 A 2 0.666 2.555 1.281 0.326 0.648 0.329 S02 A 3 0.673 1.321 0.666 1.026 2.051 1.029 S02 A 4 0.284 1.341 0.678 1.966 3.914 1.966 S02 A 5 0.980 0.564 0.292 0.511 1.012 0.507 S02 B 1 0.367 1.960 0.978 1.741 3.486 1.739 S02 B 2 0.864 0.721 0.365 1.470 2.934 1.472 S02 B 3 0.568 1.729 0.866 1.342 2.680 1.337 S02 B 4 0.149 1.134 0.575 0.210 0.423 0.215 S02 B 5 0.042 0.287 0.151 0.433 0.860 0.433 Sampling rate: 250Hz for 1500ms (375 samples) Experiment 1-2: 89 subjects and 6 experiment conditions Experiment 3: 36 subjects and 4 experiment conditions 21 Data Preprocessing (1) Temporal PCA Decomposition PCA + component 1 + component 2 = = complex waveform PCA extracts as many factors (components) as there are variables (i.e., number of samples). We retain the first 15 PCA factors, accounting for most of variances (> 75%). The remaining factors are assumed to contain “noise”. 22 Data Preprocessing (2) Intensity, spatial, temporal and functional metrics (attributes) for each factor 23 ERP Factors after PCA Decomposition TI-max (µs) IN-mean (ROI) (µv) IN-mean (ROCC) (µv) ... SP-min (channel#) 128 4.2823 4.7245 … 24 96 1.2223 1.3955 … 62 164 -6.6589 -4.7608 … 59 220 -3.635 -2.0782 … 58 244 -0.81322 0.29263 … 65 For Experiment 1 data, number of Factors = (474) (594) For Experiment 2 data, number of Factors = (588) (598) For Experiment 3 data, number of Factors = 708 24 Mining ERP Classes with Clustering (1) We use EM (Expectation-Maximization) clustering E.g. for Experiment 1 group 2 data Cluster/ Pattern P100 0 1 2 3 0 76 0 2 N100 117 1 0 54 lateN1/N 2 P300 13 14 0 104 0 61 110 42 25 Mining ERP Classes with Clustering (2) We use OWL to represent ERP Classes 26 Mining ERP Class Taxonomy with Hierarchical Clustering We use EM clustering in both divisive and agglomerative ways. E.g. for Experiment 3 data 27 Mining ERP Class Taxonomy with Hierarchical Clustering We use OWL to represent class taxonomy 28 Mining Properties and Axioms with Clustering-based Classification (1) We use decision tree learning (C4.5) to do classification with the training data labeled by clustering results. 29 Mining Properties and Axioms with Clustering-based Classification (2) We use OWL to represent datatype properties which are based on those attributes with high information gain (e.g., top 6). 30 Mining Properties and Axioms with Clustering-based Classification (3) We use SWRL to represent axioms. In FOL: 31 Discovering Axioms among Properties with Association Rule Mining We use Apriori algorithm to find association rules among properties. The split points are determined by classification rules. In FOL, they looks like: 32 Rule Optimization Idea: (A → B) (A B → C) => (A → C) And 33 A Partial View of the Mined ERP Data Ontology • Our first-generation ERP ontology consists of 16 classes, 57 properties and 23 axioms. 34 Ontology-based Data Modeling (SSDBM’08, JIIS’10) In general, ontologies can be treated as one kind of conceptual model. Considering the size of data (e.g., PCA factors) can be large, instead of building a knowledge base to store those data, we propose to use relational databases. We designed database schemas based on our ERP ontologies which include temporal, spatial and functional concepts. 35 Ontology Databases Class Relation Datat ype Datat ype Axioms keys Objects constraints Facts Now we have bridged these. triggers tuples Ontology Databases Class Relation Datat ype Datat ype Axioms keys Objects constraints Facts views triggers tuples Loading time in Lehigh University Benchmark Load Time (1.5 million facts) (10 Universities, 20 Departments) Query time Query Performance (logarithmic time) Ontology-based Data Modeling For example, especially for the important subsumption axioms (e.g., subclassof ) of the current ERP ontologies, we use SQL Triggers and Foreign-Keys to represent them. 40 Ontology-based Data Modeling The ER Diagram for the ERP ontology database shows tables (boxes) and foreign key constraints (arrows). The concepts pattern, factor, and channel are most densely connected 41 42 NEMO Data Mapping (PAKDD’10) Motivation Lack of meta-analysis across experiment because different labs may use different metrics Goal of the study Mapping alternative sets of ERP spatial and temporal metrics Problem definition Alternative sets of ERP metrics Challenges Semi-structured data Uninformative column headers (string similarity matching does not work) Numerical values Grouping and reordering Grouping and reordering Sequence post-processing Cross-spatial Join Metric Set1 Metric Set2 Process all point- sequence curves Calculate Euclidean distance between sequences in the Cartesian product set (Cross-spatial join) ●●● Cross-spatial Join Assumptions and Heuristics The two datasets contain the same or similar ERP patterns if they are from the same paradigms (e.g., oddball in visual/audio - watching or listening uncommon or fake words among common words) Gold standard mapping falls along the diagonal cells Wrong Mappings. Precision = 9/13 Experiment Design of experiment data 2 simulated “subject groups” (samples) SG1 = sample 1 SG2 = sample 2 2 data decompositions tPCA = temporal PCA decomposition sICA = spatial ICA (Independent Component Analysis) decomposition 2 sets of alternative metrics m1 = metric set 1 m2 = metric set 2 Experiment Result Overall Precision: 84.6% NEMO Related Ongoing Work Application of our framework to other domain microRNA, medical informatics, gene databases, Mapping discovery and integration across ontologies related to different modalities (e.g., EEG vs. fMRI). 55 Joint EEG-fMRI Data Mapping 56 Joint work with: Gwen Frishkoff, Jiawei Rong, Robert Frank, Paea LePendu, Haishan Liu, Allen Malony, and Don Tucker 3,4 57 Thanks for your attention ! Any Question? 58