Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
National Centre for Text Mining • Mission To provide TM tools for users, in particular, scientists and researchers To coordinate activities in the TM community ・ Core Partners University of Manchester: NLP and DM Salford University: Terminology Liverpool University: IR and Digital Archive ・ External Partners San Diego SC, UC Berkeley, University of Geneva, University of Tokyo 1 National Centre for Text Mining • Mission To provide TM tools for users, in particular, scientists and researchers Biomedical domain To coordinate activities in the TM community ・ Core Partners University of Manchester: NLP and DM Salford University: Terminology Liverpool University: IR and Digital Archive ・ External Partners San Diego SC, UC Berkeley, University of Geneva, University of Tokyo 2 Strategy and Roadmap for TM in Biomedicine Vast number of Google/Yahoo users, satisfied Huge Demand for specialized tools for TM in Bio-Medical Domains Small number of users, unsatisfied The current TM tools, though successful in some business applications, do not meet requirements of users in bio-medical domains. More publicity and marketing More demand-oriented approach What are the requirements for TM for users in bio-medical domains? What technologies should be integrated in future TM for science? Is the nature of TM in scientific fields different from that of business applications? 3 From technological seeds 4 Science: Knowledge Raw Data Unstructured Information (Text) Semi-structured Information (XML+Text) Structured Information (Data bases) Effective management of text and knowledge is the key Natural Language Processing Intelligent Text Management System Ontology-based KMS 5 Intelligent TM systems Retrieval Intelligent Information Retrieval and Question Answering Integration Integration of Text with Data and Knowledge Discovery Text Mining and Knowledge Discovery 6 From Text to Knowledge Non-Trivial Mappings Ontology Relationships among concepts Metabolic Pathways Signal Pathways Association between Diseases and Genes …… Terminology NLP Paraphrasing Language Domain Knowledge Domain Motivated Independently of language 7 Examples of Technical Seeds • Term Variants – Terms (names of proteins, genes, diseases, symptoms, etc.) denote basic conceptual units in the knowledge domain. • Syntactic Variants – Relationships and complex conceptual units are mapped to sentences. • Term Acquisition from Text – New terms (basic conceptual units) are constantly introduced. Resource building for specialized domains is crucial. 8 Examples of Technical Seeds • Term Variants – Terms (names of proteins, genes, diseases, symptoms, etc.) denote basic conceptual units in the knowledge domain. • Syntactic Variants – Relationships and complex conceptual units are mapped to sentences. • Term Acquisition from Text – New terms (basic conceptual units) are constantly introduced. Resource building for specialized domains is crucial. 9 Hypernym acronym NF-kappa B NF kappa B NFKB factor NF-KB NF kB Spelling variation Expanded form Synonym nuclear factor-kappa B nuclear-factor kappa B nuclear factor kappa B nuclear factor κB Nuclear Factor kappa B ……….. 10 Automatic Generated Term Variants (1) 1.000 0.500 0.429 0.286 0.286 0.286 0.286 0.286 0.286 0.286 0.286 0.273 0.273 0.214 0.214 0.214 0.200 NF kappa B 128 Transcription Factor NF kappa B 0 NF-kappa B 912 NF kB 0 Immunoglobulin Enhancer-Binding Protein 0 Immunoglobulin Enhancer Binding Protein 0 Transcription Factor NF-kB 0 Transcription Factor NF kB 0 Factor NF-kB, Transcription 0 nuclear factor kappa beta 2 NF kappaB 1 NF kappa B chain 0 NF kappa B subunit 0 Transcription Factor NF-kappa B 0 NF-kB, Transcription Factor 0 NF-kB 67 Neurofibromatosis Type kappa B 0 11 Automatic Generated Term Variants (2) 1.000 0.316 0.200 0.158 0.133 0.133 0.133 0.133 0.133 0.133 0.133 0.133 0.133 0.133 tumor necrosis factor A TNF A tumor necrosis factor TNF alpha TNFA TNF Tumour necrosis factor alpha Tumor Necrosis Factor alpha Tumor Necrosis Factor-Alpha TUMOR NECROSIS FACTOR.ALPHA Tumor necrosis factor alpha Tumor Necrosis Factor-alpha TNF-Alpha TNF-alpha 0 1 1653 358 32 2631 14 2 0 0 52 8 0 6899 12 Examples of Technical Seeds • Term Variants – Terms (names of proteins, genes, diseases, symptoms, etc.) denote basic conceptual units in the knowledge domain. • Syntactic Variants – Relationships and complex conceptual units in the knowledge domain are mapped to sentences in the language domain. • Term Acquisition from Text – New terms (basic conceptual units) are constantly introduced. Resource building for specialized domains is crucial. 13 Syntactic Variants [A] protein activates [B] (Pathway extraction) Full-strength Straufen lacking this insertion isholoenzyme able to assocaite Transcription initiation by theprotein sigma(54)-RNA polymerase requires Since ……., we postulate that only phosphorylated osker mRNA and activate but failssigma(54) to ….. to activate an with enhancer-binding protein thatitsis translation, thought to contact PHO2 protein could activate the transcription of PHO5 gene. transcription. Non-trivial Mapping Spelling Variants Same relations Synonyms with different Acronyms Structures Language Domain Knowledge Domain Independently motivated of 14 Language Predicate-argument structure Parser based on Probabilistic HPSG (Enju) s vp vp np arg2 mod dt np DT The NN protein vp pp arg1 vp pp np VBZ VBN IN PRP is activated by it 15 Text Archive with Feature Obejcts Managing texts, data representation and their semantics Data representation Semantics Text ID Data Base Module DB of Feature Objects content Ubiquitin 問題 内容 核開発 content Event Pr ed content bind agent agent Ubiquitin E is bound with Text DB Text Copy and Unification Start Position of the region extent textid wsj 02 Position of the region End startp 10 30 endp Annotator dc : creator ninomi Event content Pr ed bind Content Specialization by unification extent textid startp endp dc : creator content bind ubiquitin Fine grained units of wsj 02 information 1030 ninomi Context dependency protein interactin event type Persistent nature of agent knowledge and information 16 Demo (The website demo is not available now. ) 17