Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge Discovery for Cancer Informatics and Public Health Informatics: Techniques, Case Studies, and Lessons Learned Hsinchun Chen Director, Artificial Intelligence Lab, University of Arizona Acknowledgements: NIH, NSF, NCI, ACC, NTU Artificial Intelligence Lab Research • UA MIS Department (4th ranked); 30+ research scientists; $25M funding since 1990; Chen, IEEE and AAAS Fellow • Intelligence & Security Informatics research; NSF, DOJ, CIA; COPLINK system deployed in 1600+ agencies; Dark Web for countering terrorism • Biomedical Informatics research; NLM, NCI; Chen NLM Scientific Counselor; HelpfulMed, GeneScene system and BioPortal Artificial Intelligence Lab, MIS, University of Arizona 2 A Little Promotion GeneScene: Cancer Pathway Knowledge Extraction And Visualization GeneScene Team • Text Mining & Knowledge Integration • Data Mining • Visualization & System Development • Domain Experts • User Studies & Evaluation Artificial Intelligence Lab, MIS, University of Arizona Dr. Gondy Leroy (Claremont) Byron Marshall (Oregon SU) Dan McDonald (Utah SU) Zan Huang (Penn SU) Jiexun Li (Drexel U) Chun-Ju Tseng Shauna Eggers Dr. Jesse Martinez Dr. George Watts Dr. Bernie Futscher (AZCC) Dr. Hua Su Dr. Karin Quinones 5 Outline • GeneScene overview • Research directions Text mining Knowledge integration Data mining • GeneScene Visualizer Artificial Intelligence Lab, MIS, University of Arizona 6 Knowledge Explosion: PubMed No. of New Publications Accumulated New Publications 12000000 600000 10000000 550000 500000 8000000 450000 6000000 400000 4000000 350000 2000000 300000 1980 1983 1985 1988 1990 1993 1995 1998 2000 2003 Year • 1980 1983 1985 1988 1990 1993 1995 1998 2000 2003 Year Average number of new citations appearing in PubMed In 1980: 746/day In 2004: 1,640/day Artificial Intelligence Lab, MIS, University of Arizona 7 Artificial Intelligence Lab, MIS, University of Arizona 8 GeneScene Overview • Motivation Relieving information overload in biomedical research Automating processes of knowledge extraction and data analysis • Focus: genetic regulatory pathways Dissection of regulatory networks is crucial for a thorough understanding of biological processes Complexity of biological networks raises challenges for computational research • Research goals To develop novel Natural Language Processing (NLP) techniques to support information extraction To develop machine learning and data mining techniques to support highthroughput data analysis To create an integrated framework for pathway-related knowledge representation and visualization Ultimately, to provide biomedical researchers with a pathway-related knowledge discovery and integration platform • Funding: NIH/NLM, 1 R33 LM07299-01 (May 2002 – April 2006) Artificial Intelligence Lab, MIS, University of Arizona 9 GeneScene: Components & Scope Ontology-enhanced Knowledge Integration To aggregate and consolidate pathway relations extracted from literature and to integrate them with existing knowledge sources using biomedical ontologies Text Mining of Biomedical Literature To automatically extract regulatory relations between biological entities from free text Data Mining for Genomic Studies To extract regulatory pathway information based on genomic data & other resouces Visualization of Regulatory Pathways To facilitate accessing, understanding, and analysis of extracted pathway knowledge Artificial Intelligence Lab, MIS, University of Arizona 10 Text Mining • Extract all pathway-relevant relations from text • Relations with gene or protein names on either end of the relation are extracted • Two types of relations in GeneScene Co-occurrence Relations (Concept Space): relations between terms that often co-occur in a set of abstracts HelpfulMed (Cancer Space) Linguistic Relations: precise & semantically rich relations from each abstract Artificial Intelligence Lab, MIS, University of Arizona 11 HelpfulMED Search of Medical Websites HelpfulMED search of Evidence-based Databases What does database cover? Search which databases? How many documents? Enter search term Consulting HelpfulMED Cancer Space (Thesaurus) Enter search term Select relevant search terms New terms are posted Search again... Or find relevant webpages Browsing HelpfulMED Cancer Map 1 Visual Site Browser Top level map 2 3 Diagnosis, Differential 4 Brain Neoplasms 5 Brain Tumors Chinese Medical Intelligence (CMI) • Goal: Providing medical and health information services to both researchers and public. • Content: 350,000 high quality medical-related webpages collected from mainland China, Hong Kong and Taiwan. Meta-search 3 large general Chinese search engines. • Key Features: Built-in Simplified/Traditional Chinese encoding conversion Dynamic summarization for both Simplified and Traditional Chinese Automatic categorization Visualization using SOM Simplified Chinese summary Chinese folder display Chinese visualization with SOM Results are from both Simplified and Traditional Chinese Select websites from mainland China, Hong Kong and Taiwan Traditional Chinese summary Original encoding of the result Simplified/Traditional Chinese summarization Select search engines from mainland Chinese results China,Traditional Hong Kong and Taiwan haven been converted into simplified Chinese GeneScene Full Parser: Arizona Relation Parser (ARP) • Syntax and semantics are combined together in a hybrid parsing grammar as opposed to the pipelined approach • Introducing over 150 new word classes, while retaining many of the original syntax word classes (i.e. noun, verb) • The new word classes have semantic and lexical properties • Semantic and syntactic properties of the new tags are not explicitly detailed in the dictionary, but rather determined by the parsing rules that define them • Rules that apply to the tags reveal the syntactic and semantic roles of the tags Artificial Intelligence Lab, MIS, University of Arizona 21 ARP: Architecture Tagging Hybrid Parsing Dictionary / Lexicon Contextual & Lexical Rules Transition Rules Correction Rules 1 1 Sentence Splitter 2 AZ Phrase Tagger 3 AZ POS Tagger Grammar Rules 3 2 4 Pre-processing Parser (FSA) Correct Parsing Errors 5 Combine tags in tag chart using Grammar Relation Extraction Relations in Flat Files 7 Apply Semantic Constraints by Identifying Gene, Hormone, and Protein Names GO & HUGO 4 6 Identify Knowledge Patterns and Separate Conjunctions Conjunction Rules Knowledge Patterns Architecture diagram for the parser, consisting of three main stages: tagging, parsing, and relation extraction Artificial Intelligence Lab, MIS, University of Arizona 22 Problem: Gene Pathway •Title Key roles for E2F1 in signaling p53- dependent apoptosis and in cell division within developing tumors. •Abstract: Apoptosis induced by the p53 tumor suppressor can attenuate cancer growth in preclinical animal models. Inactivation of the pRb proteins in mouse brain epithelium by the T121 oncogene induces aberrant proliferation and p53-dependent apoptosis. p53 inactivation causes aggressive tumor growth due to an 85% reduction in apoptosis. Here, we show that E2F1 signals p53-dependent apoptosis since E2F1 deficiency causes an 80% apoptosis reduction. E2F1 acts upstream of p53 since transcriptional activation of p53 target genes is also impaired. Yet, E2F1 deficiency does not accelerate tumor growth. Unlike normal cells, tumor cell proliferation is impaired without E2F1, counterbalancing the effect of apoptosis reduction. These studies may explain the apparent paradox that E2F1 can act as both an oncogene and a tumor suppressor in experimental systems Action Protocols Graphic Representation p53 reads "E2F1 signals p53-dependent apoptosis" E2F1 apoptosis p53 infers So, I'm assuming... a straight line pathway... E2F1 apoptosis Expert errs and corrects E2F1 reads "E2F1 acts upstream of p53" p53 apoptosis E2F1 p53 reads "E2F1 deficiency does not accelerate tumor growth" apoptosis tumor growth Final graph Prepositions: OF/BY/IN OF BY IN q0 Nominalization (-ion) q5 Adjective, noun, verb (-ed) Adjective, Noun, verb (-ed) Nominalization (-ion) Nominalization (-ion) Negation q4 NP, 5: str1 NP q1 Aux, 1: tr13 OF q6 OF Nominalization (-ion) q7 mod Aux mod Negation q2 Adjective, noun, verb (-ed) q18 q13 NP verb aux OF verb verb q14 verb Nominalization (-ion) q15 q3 mod OF q8 BY q9 NP q11 BY q10 q12 NP IN IN NP NP BY IN q16 NP q17 IN Example Map (one abstract) Arizona Relation Parser Output Original Sentence Resulting Relation Entity 1 Negation Connector Entity 2 (1) wild-type p53 tumor suppressor protein, which induces […] apoptosis… wild-type p53 tumor suppressor protein False induces apoptosis (2) Wt p53 also induced significant apoptosis Wt p53 False also induced significant apoptosis (3) oncogene mutant p53 suppresses apoptosis oncogene mutant p53 False suppresses apoptosis (4) mutant p53 blocked E1A-induced apoptosis Mutant p53 False blocked E1A-induced apoptosis E1A False induced apoptosis mutant p53 True does not induce apoptosis (5) mutant p53 […] does not induce […] apoptosis Artificial Intelligence Lab, MIS, University of Arizona 26 Text Mining Statistics (Jan. 2005) Collection P53 AP1 Yeast Number of Abstracts 205,820 400,487 68,025 Number of Abstracts w/ Relation Extracted 87,903 90,773 28,971 Linguistic relations (full parser) 182,499 172,116 54,805 2,724,099 3,265,524 6,535,737 Concept Space Relations Artificial Intelligence Lab, MIS, University of Arizona 27 Knowledge Integration: Organizing Relations • Relations are more useful when well organized Multiple name strings of the same biological entities or processes are aggregated Important contextual information is captured Entities are cross-referenced to outside resources • Well organized relations should help with domain appropriate analysis tasks Artificial Intelligence Lab, MIS, University of Arizona 28 An Example: Context and Term Variation • 4 somewhat contradictory PubMed abstract snippets*: (1) wild-type p53 tumor suppressor protein, which induces […] apoptosis… (2) Wt p53 also induced significant apoptosis (3) oncogene mutant p53 suppresses apoptosis (4) mutant p53 blocked E1A-induced apoptosis • (1) and (2) say that “p53 induces apoptosis” • (3) and (4) say that “p53 inhibits apoptosis” * From PubMed documents 10594026, 8643473, and 11809683, and 11375269 Artificial Intelligence Lab, MIS, University of Arizona 29 An Example: Context and Term Variation • Analyzing the context more closely: Wild-type (1), & wt (2) p53 are non-mutated Mutant (3) & (4) p53 are mutated P53 protein (1) is a protein Oncogene p53 (3) is a gene • Identifying context is important in organizing extracted information. Words near “p53” suggest that while normal p53 induces apoptosis, mutated p53 inhibits it * From pubmed documents 10594026, 8643473, and 11809683, and 11375269 Artificial Intelligence Lab, MIS, University of Arizona 30 Biological Entity Recognition and Identification • To aggregate relations we need to recognize and identify biological entities • Recognition finds substance references in text, identification matches those references to external resources (Tuason et al. 2004) • Three key object recognition difficulties (Fukuda et al., Palakal et al.): Compound words Ambiguous expressions New or unknown words Artificial Intelligence Lab, MIS, University of Arizona 31 Aggregation System Design PubMed Abstracts BioAggregate Tagger Arizona Relation Parser (ARP) HUGO RefSeq SGD LocusLink GO Relational Triples Lexicon Curation BioAggregate Tagger Feature Lexicons Aggregated Relations Network Visualizer Artificial Intelligence Lab, MIS, University of Arizona Relational Triples Aggregatable Substance Lexicon Decompositional Tagging Aggregated Relations 32 A Decompositional Approach to Biomedical Concept Matching • BioAggregate tagger decomposes name strings found in a relation by left-to-right longest-first pattern matching using domain appropriate lexicons of feature-signaling terms • Lexicons built from existing resources and human generated lists Substance names in LocusLink, RefSeq, HUGO, and SGD, etc. Biological processes in Gene Ontology Lexicons of other features Artificial Intelligence Lab, MIS, University of Arizona 33 Features For Decompositional Matching Feature Lexicon Aggregatable Substance Explanation A gene and its product(s). All references to a particular gene and its product(s) share the same Aggregatable Substance value. E.g., p53, tp53, and trp53. Mutation Indicating the status of an aggregatable substance. Only has two values, mutated or not mutated (wild-type). Substance Type "Type" of aggregatable substance. Currently there are three recognized types: gene, protein, and mRNA. Associator Essentially verbs. This feature attempts to resolve verbs that occur in multiple forms, but have the same stem. E.g., inhibit, inhibits, inhibited, and inhibiting all share the Connector Associator value "inhibit." Function A biological process, such as apoptosis or angiogenesis (as in biological_process list of Gene Ontology), or an action performed on an aggregatable substance, such as phosphorylation or inhibition. Species The species/organism information associated with an entity or relation. Cellular Component Stopword The sub-cellular component or location of an entity or relation. Common words judged to meet the standard “ignoring this word will not mischaracterize pathway relations.” Artificial Intelligence Lab, MIS, University of Arizona 34 P53 Testbed • ARP extracted 182,499 relations from 87,903 PubMed abstracts related to the gene p53 • As extracted, the relations display very little overlap with 142,974 distinct entity names and 127,397 distinct relational pairs Artificial Intelligence Lab, MIS, University of Arizona 35 More Abstract More Detailed 5 Levels of Aggregation Granularity Aggregation Level Baseline (string match entities and connector) Feature Match (feature synonym resolution) Possible Applications basis of comparison detailed pathway analysis Typed Substance (distinguish genes and proteins) pathway analysis – granularity is comparable to some human-curated databases Aggregatable Substance explore the function of a gene and its gene products Simple Pathway (substance/function 4 categories for connectors) Artificial Intelligence Lab, MIS, University of Arizona high level overviews and input to some machine learning algorithms 36 Network Consolidation Results Distinct Items Identified in 182,660 the P53 Relations at Different Levels of Aggregation Number of Distinct Items 60,000 50,000 40,000 Entities 30,000 Relations 20,000 Disjoint Relations 10,000 0 Baseline Feature Match Typed Substance Typed No Aggregatable Simple Residual Substance Pathway Aggregation Level • The number of distinct entities and relations are sharply reduced over various levels of aggregation • When fewer relations are disjoint, the knowledge network encompasses more information • Network density increased 20-fold Genomic Data Mining in AI Lab • Joint learning of genetic networks from microarray data & existing knowledge • Gene selection for cancer diagnosis Artificial Intelligence Lab, MIS, University of Arizona 38 Gene Selection for Cancer Diagnosis • Gene array data have been widely used for cancer classification/prediction (Golub et al., 1999; Ben-Dor et al., 2001) • The major problems of gene array data (Model et al., 2001; Lu & Han, 2003) High dimensionality (hundreds or thousands of genes) Small number of available samples Most genes are irrelevant to cancer distinction Genes are interacting with each other • It is important to identify the marker genes for cancer diagnosis Artificial Intelligence Lab, MIS, University of Arizona 39 Experiment: Ovarian Cancer Diagnosis • Ovarian cancer 25,580 projected cases in 2004 16,090 deaths estimated in 2004 53% overall 5 year survival 31% 5 year survival in those with distant metastases at diagnosis 75% of cases diagnosed in late stages (III & IV) • Predict survival of ovarian cancer: alive or dead? Clinical measurements • Two attributes: stage & grade Gene methylation level • Differentially methylated between normal and cancer tissues Artificial Intelligence Lab, MIS, University of Arizona 40 Ovarian Cancer Methylation Array • University of Iowa Gynecologic Oncology tumor bank (provided by Dr. Bernie Futscher at AZCC) 114 DNA samples • • • • • • 11 Normal ovary 19 Stage I 18 Stage II 24 Stage III 17 Stage IV 25 Low malignant potential (LMP) 6560 genes • Top 1000 genes with highest standard deviation across all samples are regarded potentially relevant Artificial Intelligence Lab, MIS, University of Arizona 41 Gene Selection Techniques • Individual gene ranking F-statistics (Mendenhall & Sincich, 1995) • Gene subset selection Optimal search algorithms • Genetic algorithm (GA) (Holland, 1975) • Tabu search (TS) (Glover, 1986) Evaluation criteria • Maximum relevance & minimum redundancy (MRMR) (Ding & Peng, 2003) • Support vector machine (SVM) (Vapnik, 1995) Artificial Intelligence Lab, MIS, University of Arizona 42 Marker Genes for Survival Prediction • Q1: Which genes can be used to predict the survival of ovarian cancer based on their methylation level? Level Full set F-stat GA/MRMR GA/SVM TS/MRMR TS/SVM #F 1000 100 57 39 24 96 Pooled StDev = N 30 30 30 30 30 30 Mean 67.690 77.398 76.199 80.263 80.702 82.807 1.847 StDev 2.100 1.414 2.018 1.592 1.613 2.205 Individual 95% CIs For Mean Based on Pooled StDev ------+---------+---------+---------+ (*-) (-*) (*-) (-*) (*-) (-*) ------+---------+---------+---------+ 70.0 75.0 80.0 85.0 • Conclusion TS/SVM selected 96 out of 1000 genes, which achieved the highest prediction accuracy (82.807%) Artificial Intelligence Lab, MIS, University of Arizona 43 Methylation vs. Clinical Diagnosis • Q2: Will methylation-based methods perform better than clinical diagnosis in predicting survival of ovarian cancer? Level Clinical Full set F-stat GA/MRMR GA/SVM TS/MRMR TS/SVM #F 2 1000 70 40 46 24 30 Pooled StDev = N 30 30 30 30 30 30 30 Mean 75.281 57.566 75.581 75.506 79.813 77.715 81.948 1.790 StDev 1.770 2.747 1.413 1.756 0.205 1.868 1.769 Individual 95% CIs For Mean Based on Pooled StDev ---------+---------+---------+------(*) (*) (*) (*) (*) (*) (*) ---------+---------+---------+------63.0 70.0 77.0 • Conclusions: prediction accuracy Full set < Clinical < Marker genes (selected by TV/SVM) Artificial Intelligence Lab, MIS, University of Arizona 44 GeneScene Visualizer • To provide graphical presentation of large-scale regulatory networks comprised of pathway relations extracted through text mining technologies • Three testing collections p53 (87,903 abstracts) AP1 (90,773 abstracts) Yeast (28,971 abstracts) • Currently loading and parsing entire PubMed for Cancer pathway ~ 7 million abstracts Artificial Intelligence Lab, MIS, University of Arizona 45 GeneScene Visualizer: Functionality • Searching: by specific elements, e.g., diseases or genes • Network-based exploration and navigation • Accessing the underlying PubMed abstract • Saving and loading search and visualization results • Various manipulations on the table and network view of the retrieved relations: filter, sort, zoom, highlight, isolate, expand, print, etc. Artificial Intelligence Lab, MIS, University of Arizona 46 GeneScene Visualizer V1.5 Artificial Intelligence Lab, MIS, University of Arizona 47 GeneScene Visualizer V1.5 Artificial Intelligence Lab, MIS, University of Arizona 48 Affect of Aggregation • Same relations, before and after aggregation Baseline level Simple Pathway level Artificial Intelligence Lab, MIS, University of Arizona 49 Affect of Aggregation: Mutation Feature When mutant and non-mutant are combined, an apparent conflict arises: TP53 is both activating and inhibiting MDM2. Artificial Intelligence Lab, MIS, University of Arizona When the Mutation feature is selected, non-mutant TP53 is shown to activate and mutant TP53 to inhibit MDM2. 50 User Feedback General Comments • Interviewees were generally impressed with the features and usefulness of the system “In my head I've been trying to do what this is doing for you.” “It took me a few weeks just to find that Sin3 interacts with p53, where when you type this in [to Genescene] it's right there.” “ Just playing around [with the system] I am seeing things that I didn't know before.” “If this is the entire Medline, I would probably use it every time I search.” “I think a lot of people would get a lot of use out of this, as long as it doesn't scare them off in the beginning.” Artificial Intelligence Lab, MIS, University of Arizona 51 Lessons Learned • • • • Biomedical information is precise but terminologies fluid Biomedical professionals need search and analysis help Biomedical linguistic parsing and ontologies are promising for biomedical text mining The need for integrated biomedical data (gene microarray) and text mining (literature) Ongoing Research • Combining bottom-up data mining (MicroArray/Methylation) with top-down text mining results • Creating CancerPath testbed for cancer genomic network visualization • Biological networks topological analysis (growth, preferential attachment) • Other biomedical applications: plant science pathway (Arabidopsis; Galbraith Lab); infectious disease surveillance Artificial Intelligence Lab, MIS, University of Arizona 53 BioPortal: Infectious Disease Information Sharing, Surveillance, Analysis, and Visualization Research Partners and Supports • • • • • • • • • • University of Arizona University of California, Davis Kansas State University University of Utah Arizona Department of Public Health New York State Department of Health/HRI California Department of Health Services/PHFE U.S. Geological Survey The SIMI Group National Taiwan University • • • • • • NSF CIA/ITIC DHS DOD/AFMIC CDC AZDPS UA Team Members • Dr. Hsinchun Chen • Dr. Daniel Zeng • Lu Tseng • Cathy Larson • Kira Joslin • • • • • Wei Chang James Ma Hsinmin Lu Ping Yan Aaron Sun • • • • Keith Alcock Sapna Brahmanandam Milind Chabbi Yuan Wang Outline • Project Background • BioPortal Achievements System Architecture System Functionalities BioPortal Collaboration Framework • New Developments International Foot-and-mouth Disease Monitoring Syndromic Surveillance Disease Contact Tracing BioPortal Project Goals • Demonstrate and assess the technical feasibility and scalability of an infectious disease information sharing (across species and jurisdictions), alerting, and analysis framework. • Develop and assess advanced data mining and visualization techniques for infectious disease data analysis and predictive modeling. • Identify important technical and policy-related challenges in developing a national infectious disease information infrastructure. Information Sharing Infrastructure Design Data Ingest Control Module Cleansing / Normalization PHINMS Network NYSDOH Adaptor Adaptor SSL/RSA Adaptor SSL/RSA Info-Sharing Infrastructure Portal Data Store (MS SQL 2000) XML/HL7 Network CADHS New Data Access Infrastructure Design Public health professionals, researchers, policy makers, law enforcement agencies & other users WNV-BOT Portal Browser (IE/Mozilla/…) Data Search and Query SpatialTemporal Visualization SSL connection Analysis / Prediction Dataset Privileges Management Web Server (Tomcat 4.21 / Struts 1.2) User Access Control API (Java) Data Store HAN or Personal Alert Management Data Store (MS SQL 2000) Access Privilege Def. Datasets Integrated: WNV, BOT Index Dataset Test Data Available Data Duration (MM/YY) Spatial Granularity Data Size Temporal Granularity 1 [NY] WNV Human Yes Test Data 574 Zip Date 2 [NY] Dead Bird Yes Test Data 942 Lat/Long shifted 3 [NY] Mosquito Yes Test Data 815 County Date 4 [NY] WNV Captive Animal Yes Test Data 39 Zip Date 5 [NY] Botulism Human Yes Test Data 10 Zip Date 6 [CA] WNV Human Yes 09/03–10/03 186 County Date 7 [CA] Dead Bird Yes 01/03–10/03 3032 City/zip Minutes 8 [CA] Chicken Sera Yes 04/03–10/03 18887 Site Date 9 [CA] Mosquito Pool Yes 01/98–10/03 3518 Site Date 10 [CA] Botulism Yes 01/01–12/02 53 Zip Date 11 [USGS] EPIZOO - Preliminary Yes 07/99–09/03 46 events County Date 12 [USGS] EPIZOO – WNV Yes 08/1999-07/2004 113 events County Date 13 [USGS] EPIZOO - Botulism Yes 12/1989-12/2004 702 events County Date 14 [UC Davis] FMD Yes 1996 - 2003 3288 Site/Province Date/Month 15 International FMD Yes 01/1982-03/2005 6789 Province Non-temporal 16 BioWatch Yes 1/10- 1/17 2004 480 Exact Site Date 17 [CA] Mosquito Treatment Yes 1/14-11/30 2004 6194 Exact Site Date 18 National Infant Botulism Yes 1/16-11/25 2004 15 Zip Date Communications/Messaging • Scalable, flexible, light-weight, and extendible. Easy to include: New diseases New jurisdictions New techniques! • Messaging infrastructure – installed and tested NYSDOH-UA: PHIN MS CADHS-UA: Regional message broker NWHC-UA: PHIN MS • XML generation/conversion NY_DeadBird, NY_Alerts, NY_BotHuman, NY_WNVHuman, NY_CaptiveAnimal, NY_Mosquito CA_BotHuman, CA_WNVHuman, CA_DeadBird, CA_Chicken, CA_Mosquito USGS_Epizoo Spatio-Temporal Data Mining & Hotspot Analysis • A hotspot is a condition indicating some form of clustering in a spatial and temporal distribution (Rogerson & Sun 2001; Theophilides et al. 2003; Patil & Tailie 2004). • For WNV, localized clusters of dead birds typically identify high-risk disease areas (Gotham et al. 2001). • Automatic detection of dead bird clusters using hotspot analysis can help predict disease outbreaks and aid in effective allocation of prevention/control resources. Case Study (NY WNV) 140 records March 5 224 records May 26 baseline July 2 new cases On May 26, 2002, the first dead bird with WNV was found in NY • Based on NY’s test dataset Analysis results from SaTScan and RNNH SaTScan picks large cluster - 71 new - 7 baseline SaTScan #2 Zoom in Hotspots Zoomdensity high in area RNNH RNNH picks small cluster - 53 new - 6 baseline RNNH SaTScan SaTScan #1 NY Deadanalysis Baseline Hotspot Close-up cases + of bird new the 2002 in hotspots cases results zoomed-in in zoomed-in area area Dead Bird Hotspots Identified WNV/BOT BioPortal Acknowledgment: NSF, ITIC, NYSDH, CDHS, USGS (Drs. Kvach and Ascher) Dataset name Advanced Spatial / Temporal Search criteria Select background maps Results listed in table Available dataset list User main page Positive cases Time range Select NY / CA population, river and lakes County / State Choose WNV disease data Select CA dead bird, chicken and NY dead bird data Positive cases User Login Positive cases Start STV Specify bird species NY deaddistribution Spatial bird temporal distribution pattern pattern GIS Timeline Close Zoom in NY Zoom in Periodic Pattern Year 2001 data Control panel Move time slider, year 3 2 2 weeks View1 all year 3 year window window datain 3 year span Concentrated Similar time Overall pattern in May / Jun pattern Spatial distribution Overlay population map pattern Dead bird cases Dead bird cases migrate from long island distribute along Into upstate NY populated areas near Hudson river Enable population map Season end Move time slider BioPortal HotSpot Analysis: RSVC, SaTScan, and CrimeStat Integrated (first visual, real-time hotspot analysis system for disease surveillance) • West Nile virus in California Hotspot Analysis-Enabled STV Select hotspot to Regular STV highlight case points Select algorithms Hotspots found! Select baseline and case periods Select target baseline and case periodsarea geographic International FMD BioPortal Acknowledgment: DHS, DOD, UC Davis (Drs. Thurmond and Lynch) International FMD BioPortal Goals • Real-time, web-based situational awareness of FMD outbreaks worldwide through the establishment of an international information sharing and analysis system • FMDv characterization at the genomic level integrated with associated epidemiological information and modeling tools to forecast national, regional, and/or international spread and the prospect of import into the U.S. and the rest of North America • Web-based crisis management of resources—facilities, personnel, diagnostics, and therapeutics Research Plans • Global FMD epidemiological data (Near) real-time data collection Web-based information sharing and analysis • International FMD news Indexed collection of global FMD news Search and visualization of the FMD news via the web • FMD genetic/sequence data Predictive model using phylogenetic, spatial, and temporal information to stop FMD at the boarder Visualization for FMD event in time, space, and genetic space Preliminary Global FMD Dataset • • • • • • Provider: UC Davis FMD Lab Information sources: reference labs and OIE Coverage: 28 countries globally Time span: May, 1905 – March, 2005 Dataset size: 30,000+ records of which 6789 records are complete Host species: Cattle, Caprine, Ovine, Bovine, Swine, NK, Elephant, Buffalo, Sheep, Camelidae, Goat Regionwise Distribution of FMD Data Europe 14% Africa 1% Middle East Asia 4% Central and South Asia 15% Buffaloes Elephant 3% 0% Sheep Goats 0% 3% Camelidae Cattle 0% 5% Caprine 4% Swine 11% Ovine 37% South America 66% Bovine 37% Global FMD Coverage in BioPortal FMD Migration Visualization using BioPortal (cases in South Asia) FMD Cases travel back and forth between countries International FMD News • Provider: UC Davis FMD Lab • Information sources: Google, Yahoo, and open Internet sources • Time span: Oct 4, 2004 – present (real-time messaging under development) • Data size: 460 events (6/21/05) • Coverage: 51 countries UNDEFINED 5% (Africa:11, Asia:16, Europe:12, Americas:12) America 27% Africa 11% Aisa 1% Asia 15% Australia 14% Europe 27% Searching FMD News http://fmd.ucdavis.edu/ Searchable by Date range Country Keyword Visualizing FMD News on BioPortal FMD Genetic Information Analysis • Genome clustering analysis Phylogenetic clustering Spatial clustering Temporal clustering • Hotspot detection among gene sequences Create a tree structure based on semantic distance between gene sequences. Automatically detect the dense portion of the tree. Identify the connection between the semantic cluster and the geographic pattern of gene sequences. FMD Genetic Visualization • Goal: Extend STV to incorporate 3rd dimension, phylogenetic distance Include a phylogenetic tree. Identify phylogenetic groups and color-code the isolate points on the map. Leverage available NCBI tools such as BLAST. • Proof of concept: SAT 2 & 3 analysis Data: 54 partial DNA sequence records in South Africa received from UC Davis FMD Lab (Bastos,A.D. et al. 2000, 2003) Date range: 1978-1998 Countries covered: South Africa, Zimbabwe, Zambia, Namibia, Botswana Sample FMD Sequence Records Color-coded View (MEGA3) Textual View of Gene Sequence Phylogenetic Tree of Sample FMD Data Identify 6 groups within 2 major families (MEGA3; based on sequence similarity) Group6 Group1 Group5 Group2 Group4 Group3 Genetic, Spatial, and Temporal Visualization of FMD Data Phylogenetic tree color coded Isolates’ locations color coded Isolates’ appearances in time FMD Time Sequence Analysis First family cases appeared throughout the period 2nd family cases exist before 1993 and a comeback lately Second family cases existed before 1993 and reappeared later after 1997 FMD Periodic Pattern Analysis 2nd family concentrated in Feb. while 1st family spread evenly Locations of Family 1 records Selected only groups 1, 2, and 3 and found a spatial cluster Locations of Family 2 records Sparse isolate locations Selected only groups 4, 5, and 6 Syndromic Surveillance 91 2015/10/12 Chief Complaints As a Data Source • Chief complaints (CCs) are short free-text phrases entered by triage practitioners describing reasons for patients’ ER visit Examples: lt foot pain [left foot pain]; cp [chest pain]; sob [shortness of breath]; so [should be ‘sob’]; poss uti [possibly urinary tract infection] • Advantages of using CCs for surveillance purposes Timeliness: Diagnose results are on average 6 hours slower than CCs Availability and low-cost: Most hospitals have free-text CCs available in electronic form 92 2015/10/12 Existing CC Classification Methods Classification Method Systems Authors Keyword Match + Synonym List + Mapping Rules DOHMH (NY City), EARS Mikosz et. al. (2004) Weighted Keyword Match (Vector Cosine Method) + Mapping Rules ESSENCE Sniegoski (2004) Naïve Bayesian Bayesian Network RODS N/A Olszewski (2003), Ivanov et. al (2002) Chapman et. al. (2004) 93 2015/10/12 Overall System Design Stage 1 Stage 2 Chief Complaints symptoms CC Standardization Symptom Grouping Weighted Semantic Similarity Score EMT-P UMLS Ontology UMLS Concepts Synonym List Stage 3 Symptom Grouping Table EMT-P Symptom Groups Syndromes Syndrome Classification JESS EARS Syndrome Rules EARS Symptom Table 94 2015/10/12 A Stage 2 Example: CC Concepts Symptom Group Concepts coagulopathy purpura 4 Blood In urine 5 5 5 ureteral stone coma ecchymosis 6 bleeding= 1/4+1/5+1/6= 0.62 other=1/5=0.2 coma=1/5=0.2 dead=1/5=0.2 UMLS 5 out pass 2015/10/12 altered_mental_status= 1/5=0.2 95 Summary of Stage 2 Performance Covered by the EARS/EMT-P 3% 3001 concepts 1835 CC records from Stage 1 44% 53% Additional coverage suggested by our Weighted Semantic Similarity Score approach Unidentified contains Covered by the EARS/EMT-P 417 unique concepts 11% 64% 25% Additional coverage suggested by our Weighted Semantic Similarity Score approach Unidentified 2015/10/12 96 BioPortal – Taiwan Syndromic Surveillance 97 2015/10/12 Multi-lingual Chief Complaints: Chinese Example • Data Characteristics: Mixed expressions in both Chinese and English • 頭痛;頭暈;FEVER;腹痛;噁心嘔吐多次;旅遊史(無) • 車禍,導致左手背A/W,疼痛不適,咳嗽有痰 • 18% CC records from NTU Med. Center contain Chinese expressions. • Some hospitals have 100% CC records in Chinese (For example, 馬偕紀念醫院) Misspellings and typographic errors are not serious 98 2015/10/12 Chinese CC Preprocessing: System Design English Expressions Stage 0.1 Chinese Chief Complaints Stage 0.2 Chinese Separate Chinese Chinese and Expressions Phrase English Segmentation Expressions Chinese Medical Phrases Raw Chinese CCs Segmented Chinese Phrases Common Chinese Phrases Stage 0.3 Chinese Phrase Translation Translated Chinese Phrases Chinese to English Dictionary Mutual Info. 99 2015/10/12 Result: Self Validation • Use the 280 translations against 1978 chief complaints from hospital A • 1610 (82%) records are in English • 368 (18%) records contain Chinese • 36% contains trivial info. Eg. “r/o septic shock 外院轉入” • 64% contains non-trivial info. Eg. “poor intake and 味覺喪失” • 67% has complete translation • 2% has partial translation 100 • 20% does not have translation 2015/10/12 General Grouping • Taiwan surveillance data visualization: 2.2M+ scrubbed chief complaints records 101 2015/10/12 Group by Hospital 102 2015/10/12 Group by Syndrome Classification 103 2015/10/12 Disease Outbreak Detection Using Chief Complaints • Markov Switching Model 104 Data Source • Emergency Department Free-text Chief Complaints (CCs) Medical practitioners use both Chinese and English to record CCs 368,151 records; 23.77% contains Chinese characters Time period: 2000-6-30 to 2003-4-27 • Use BioPortal Multilingual CC Classifier to classify CC records into syndromes Syndrome Prevalence • Botulism-Like (1.4%) • Rash (2.4%) • Constitutional (25.4%) • Respiratory (17.8%) Upper Respiratory • Gastrointestinal (7.3%) (26.4%) Lower Respiratory • Hemorrhagic (6.4%) (12.7%) • Neurological (14.1%) • Fever (18%) • Other (34.9%) • Choose Resp. and GI syndrome for further analysis • Two syndromes with high prevalence • Can be extended to other syndromes GI Syndrome Time Series Autocorrelation Function Series the_ts 0.4 ACF 120 0.2 100 0.0 80 -0.2 60 the_ts 0.6 140 0.8 160 1.0 180 Gastrointestinal Syndrome Time Series 2001 2002 2003 0 100 200 Lag 300 400 GI Syndrome Time Series (cont’d) • • • • Strong day-of-week effect Seasonal effect is less strong Sporadic jumps Seems to have 1 – 2 peaks per year 100 60 GI Time Series Count 140 180 Estimation Results (GI Syndrome) 2002 2003 2001 2002 2003 2001 2002 2003 30 -10 0 10 0.0 0.2 0.4 0.6 0.8 1.0 Outbreak State State Jumps post_jump_size 50 2001 Estimation Results (GI Syndrome) (cont’d) • Jumps appear during Chinese New Years • The Markov switching model identified 4 high GI-count period 2000-12-23 to 2001-1-28 (Jan. 23 New Year Eve) 2002-1-29 to 2002-3-15 (Feb. 11 New Year Eve) 2002-5-9 to 2002-10-14 2002-12-13 to 2003-2-18 (Jan. 30 New Year Eve) Taiwan SARS Contact Tracing 111 2015/10/12 Social Network Analysis in Epidemiology • Conceptualizing a population as a set of individuals linked together to form a large social network provides a fruitful perspective for better understanding the spread of some infectious diseases. (Klovdahl, 1985) • Social Network Analysis in epidemiology has two major activities: Network Construction • Link the whole set of persons in a particular population with relationships or types of contacts Network Analysis • Measure and make inferences about structural properties of the social networks through which infectious agent spread 112 A Taxonomy of Network Construction Network Construction Disease Sexually Transmitted Disease (STD) Tuberculosis (TB) Severe Acute Respiratory Syndrome (SARS) Linking Relationship Examples Sexual Contact AIDS (Klovdahl, 1985) Gonorrhea (Ghani et al., 1997) Syphilis (Rethenberg et al., 1998) Drug Use AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998) Needle Sharing AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998) Social Contact AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998) Personal Contact (Klovdahl et al., 2001) (McElroy et al, 2003) Geographical Contact (Klovdahl et al., 2001) (McElroy et al, 2003) The Source of Infection (CDC*, 2003) (Shen et al., 2004) Personal Contact (Meyers et al., 2005) *CDC: Centers for Disease Control and Prevention 113 A Taxonomy of Network Analysis Network Analysis Levels of Analysis Description Examples Network Visualization Show the spread of an infectious agent transmitted from one person to another AIDS (Klovdahl, 1985) Syphilis (Rethenberg et al., 1998) SARS (CDC*, 2003) SARS (Shen et al., 2004) Network Measurement Study the structure of a population through which an infectious agent is transmitted during close personal contact Syphilis (Rethenberg et al., 1998) AIDS (Klovdahl et al., 1994) AIDS (Rethenberg et al., 1998) Develop disease containment strategies or programs Network Simulation Evaluate the spread of an infectious agent within a population with different network parameters *CDC: Centers for Disease Control and Prevention Gonorrhea (Ghani et al., 1997) SARS (Meyers et al., 2005) 114 Network Visualization • Utilize social network to visualize the transmission of an infectious agent from one person to another within a particular population • Focus on the identification of Subgroups within the population Characteristics of each subgroup Bridges between subgroups which transmit a disease from a subgroup to another Clusters in Singapore Source (CDC, 2003) Syphilis Transmission (Rothenberg et al., 1998) 115 Epidemic Phases and Social Networks • Potterat et al. (2001) proposed that structure of sexual networks is a more reliable indicator of STD epidemic phase. Two sexual networks in Colorado Springs, U.S. were compared: • Bacterial STD from 1990 to 1991 (a STD outbreak) • Chlamydia from 1996 to 1999 (stable or declining phase) Sexual network in stable or declining phase was relatively • Fragmented • Dendritic • Lack of cyclic structures • Cunningham et al. (2004) further examined the relationship between network characteristics and epidemic phases. After epidemic • Macro-level structure Average distance declined. Density increased. • Micro-level structure Numbers of n-cliques and k-plexes declined. 116 Research Test Bed • We use Taiwan SARS data as our research test bed. • SARS (Severe Acute Respiratory Syndrome) is a novel infectious disease which emerged in 2002. The first human case was identified in Guangdong Province, China on November 16, 2002. (Donnelly et al., 2004) A 65-years-old doctor from Guangdong Province stayed at a hotel in Hong Kong in February 2003 and infected at least 17 other guests and visitors at the hotel, some of whom later came to other countries and initiated local transmission of SARS. (Peiris et al., 2006) 26 countries, including Vietnam, Singapore, Canada, and Taiwan, reported SARS cases. Financial impact: $50B 117 SARS in Taiwan • The first SARS case in Taiwan was a Taiwanese businessman who traveled to Guangdong Province via Hong Kong in the early February 2003. Had onset of symptoms on February 26, 2003 Infected two family members and one healthcare worker • Eighty percent of probable SARS cases were infected in hospital setting. The first outbreak began at a municipal hospital in April 23, 2003. Total seven hospital outbreaks were reported. Hospital shopping and transfer were suspected to trigger such sequential hospital outbreaks. 118 Taiwan SARS Data • Taiwan SARS data was collected by the Graduate Institute of Epidemiology at National Taiwan University during the SARS period. • In this dataset, there are 961 patients, including 638 suspected SARS patients and 323 confirmed SARS patients. • The contact-tracing data of patients in this dataset has two main categories, personal and geographical contacts, and nine types of contacts. Personal contacts: family member, roommate, colleague/classmate, and close contact Geographical contacts: foreign-country travel, hospital visit, high risk area visit, hospital admission history, and workplace 119 Taiwan SARS Data (Cont.) • Hospital admission history is the category with largest number of records (43%). • Personal contacts are primarily comprised of family member records. Category Personal Geographical Type of Contacts Records Suspected Patients Confirmed Patients Family Member 177 48 63 Roommate 18 11 15 Colleague/Classmate 40 26 23 Close Contact 11 10 12 Foreign-Country Travel 162 100 27 Hospital Visit 215 110 79 High Risk Area Visit 38 30 7 Hospital Admission History 622 401 153 Workplace 142 22 120 1425 638 323 Total 120 Research Design 121 Phase Analysis (Cont.) • Network Partition We partition each contact network on a weekly basis with linkage accumulation. From 2/24 to 5/4, there are 10 weeks in total. 2/24 3/3 3/10 3/17 5/4 Personal Contact Network Week1 Week2 Week3 …………… Week10 122 Phase Analysis (Cont.) • Network Measurement We investigate two factors that contribute to the transmission of disease in macro-structure: • Density: the degree of intensity to which people are linked together Density Average degree of nodes • Transferability: the degree to which people can infect others Betweenness Number of components Lower density Higher density Lower Transferability Higher Transferability 123 Connectivity Analysis • Geographical contacts provide much higher connectivity than personal contacts in the network construction. Decrease the number of components from 961 to 82 Increase the average degree from 0.31 to 108.62 Applied Contacts in the network construction Average Degree (Patient Nodes) Maximum Degree (Patient Nodes) Number of Components 0.31 4 847 Geographical Contacts 108.62 474 82 Personal + Geographical Contacts 108.85 474 10 Personal Contacts 124 Connectivity Analysis (Cont.) • The hospital admission history provides the highest connectivity of nodes in the network construction. • The hospital visit provides the second highest connectivity. • This result is consistent with the fact that most of patients got infected in the hospital outbreaks during the SARS period. Applied Contacts in the network construction Personal Contacts Geographical Contacts Average Degree Maximum Degree Number of Components Family Member 0.204 4 893 Roommate 0.031 2 946 Colleague/Classmate 0.06 3 934 Close Contact 0.023 1 949 Foreign-Country Travel 2.727 41 848 Hospital Visits 10.077 105 753 High Risk Area Visit 1.388 36 924 Hospital Admission History 50.479 289 409 Workplace 4.694 61 823 125 One-Mode Network with Only Patient Nodes :Suspected :Confirmed 126 Contact Network with Geographical Nodes :Area :Hospital :Suspected :Confirmed 127 Potential Bridges Among Geographical Nodes • Including geographical nodes helps to reveal some potential people who play the role as a bridge to transfer disease from one subgroup to another. 128 Network Visualization (Cont.) • For a hospital outbreak, including geographical nodes and contacts in the network is also useful to see the possible disease transmission scenario within the hospital. • Background of the Example Mr. L, a laundry worker in H Hospital, had a fever on 2003/4/16 and was reported as a suspected SARS patient. Nurse C took care of Mr. Liu on 4/16 and 4/17. Nurse C and Ms. N, another laundry worker in H Hospital, began to have symptoms on 4/21. H Hospital was reported to have an SARS outbreak on 4/24. Nurse C’s daughter had a fever on 5/1. 129 Phase Analysis – Density • Normalized density and average degree show similar patterns: In the importation phase, foreign-country contact network increases dramatically in Week 4 (3/17-3/23), followed by personal contact network. In the hospital outbreak phase, both personal and hospital networks increase dramatically. But in Week 10, personal network still increases while hospital network decreases. Hospital Outbreak Importation 0.45 Hospital Outbreak Importation 0.45 0.4 0.4 0.35 0.35 0.3 0.3 Foreign Country 0.25 Hospital 0.2Personal 0.25 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 2 3 4 5 6 Density 7 8 9 10 2 Foreign Country Hospital Personal 3 4 5 6 7 8 Average Degree 9 10 130 Phase Analysis – Transferability • From betweenness, we can see that personal network doesn’t have enough transferability until Week 9. Personal network just forms several small fragments without big groups in the importation phase. • From the number of components, hospital network is the only one which can consistently link patients together. Hospital Outbreak Importation 0.9 Hospital Outbreak Importation 12 0.8 10 0.7 8 0.6 0.5 0.4 Foreign Country Foreign Country 6 Hospital Hospital Personal Personal 4 0.3 0.2 2 0.1 0 0 2 3 4 5 6 7 Betweenness 8 9 10 2 3 4 5 6 7 8 9 Number of Components 10 131 Ongoing Research • Worldwide infectious disease breaking news collection, monitoring, and analysis • Markov-switching model based disease surveillance • Infectious disease social network analysis and contact tracing • Other public health concerns and infectious disease applications Artificial Intelligence Lab, MIS, University of Arizona 13 2 Building Research Partnership • Emerging critical medical and public health concerns • Willing and engaging international domain (biomedical) partners and funding sources • Data, data, and more data • From academic research to scalable solutions/systems and lasting impacts Artificial Intelligence Lab, MIS, University of Arizona 13 3 For more information: BioPortal web site: http://www.bioportal.org AI Lab web site: http://ai.arizona.edu [email protected]