* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Novel Mechanistic Insights in Cardiovascular Health
Survey
Document related concepts
Transcript
Novel Mechanistic Insights in Cardiovascular Health and Disease via a Text Mining Approach NIH BD2K AHM November 30th, 2016 Peipei Ping Heart BD2K Text Mining: From Unstructured Textual Data to Structured Networks • Pubmed boasts a treasure trove of >2.2 million cardiovascular-related articles from 1809-2016, and it is estimated that there is a new publication every ~2.7 minutes (Lau et al., Circulation 2016). • However, these unstructured mounting textual data are interconnected. • The key from Big textual Data to Knowledge is Structuring! Transforming unstructured textual data into structured and interconnected relationships. Mining phrases from massive textual data. Entity recognition and typing. Pattern extraction for biological relationship discovery. Aims of the Study • To conduct natural language processing and pattern learning on published natural textual data bases in 6 main groups of CVD. • To explore the application of phrase-mining and network embedding on textual CVD data to extract relevant information and identify novel patterns or classifications. • To facilitate predictive analytics, gain novel mechanistic insights and facilitate clinical decision making. Text Mine Biomedical Corpora Corpora Pattern & Relationship Discovery Scientific Articles Proteins Patient Phenotyping Genes Disease Exploration Metabolites Therapy Development Novel Mechanistic Insights & Clinical Decision Making Text-mine Medical Case Reports EHR New Knowledge Methods Technologies: • Segphrase+, a phrase-mining algorithm • Large-scale Information Network Embedding (LINE) Input: • List of top-250 proteins relevant to CVD • 551,358 publications (1995-2016) in Pubmed based on the MeSH terms and synonyms within each of the following CVD: Cerebrovascular Accidents (CVA), Cardiomyopathies (CM), Ischemic Heart Diseases (IHD), Arrhythmias, Valve Disease (VD) and Congenital Heart Disease (CHD) Medical Subject Headings (MeSH) • National Library of Medicine (NLM)’s controlled vocabulary thesaurus • Used to index articles from 5,400 of the world's leading biomedical journals for the MEDLINE®/PubMED®. • Maintained and updated by MeSH Section staff • Hierarchical structure: Broad and specific terms Cardiovascular Diseases Heart Diseases Cardiac Arrhythmias Sick Sinus Syndrome Cardiac Sinus Arrest https://www.nlm.nih.gov/mesh Finding Scientific Manuscripts Using MeSH Terms Text mine: CaseOLAP workflow Corpus: PubMed Research Articles 1995-2016 Six main CVD groups with their MeSH terms Names of 250 proteins highly relevant in CVD and synonyms Extract and screen CVD-related articles by MeSH terms Input extracted articles and list of proteins (with synonyms included as unified string) Calculate text-mining score of each protein-disease pair using CaseOLAP Rank Calculation Integrity: The phrase is meaningful, understandable, and high-quality. int(p,c) calculated by SegPhrase+ in preprocessing Distinctiveness: The phrase has a relatively larger count in the extracted articles of one disease than in the extracted articles of the other five diseases. Popularity: The phrase has larger total count in the extracted articles of that disease than other phrases. 𝑓𝑖𝑛𝑎𝑙 𝑟𝑎𝑛𝑘 = 3 𝑖𝑛𝑡𝑒𝑔𝑟𝑖𝑡𝑦 ∗ 𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 ∗ 𝑑𝑖𝑠𝑡𝑖𝑛𝑐𝑡𝑖𝑣𝑒𝑛𝑒𝑠𝑠 Rank list of proteins in each CVD Rank Calculation 𝑓𝑖𝑛𝑎𝑙 𝑟𝑎𝑛𝑘 = 3 𝑖𝑛𝑡𝑒𝑔𝑟𝑖𝑡𝑦 ∗ 𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 ∗ 𝑑𝑖𝑠𝑡𝑖𝑛𝑐𝑡𝑖𝑣𝑒𝑛𝑒𝑠𝑠 • Integrity — Name is a meaningful, understandable and high-quality phrase • Popularity — A large total count in the extracted articles of that disease • Distinctiveness — Has a relatively larger count in the extracted articles of that disease than in the extracted articles of other five diseases. Example: Seed pair: <Breast Cancer, brca1> Query: <Cardiomyopathy, ?> Protein Score Interferon-γ 3.336 Interleukin-4 2.809 Interleukin-17a 2.729 TNF 2.549 Titin 2.349 Total List of Proteins in CVD According to Their Scores 250 molecules with their scores in CVD https://dx.doi.org/10.6084/m9.figshare. 4055886.v1 Count Text-Mining Reveals Novel Biomedical Insights & New Patterns Among Key Proteins and 6 CVDs Top 25 Scoring Proteins in 6 CVDs Score Biological Functions and Pathways of the Top 50 CaseOLAP Scoring proteins over 6 CVD Groups Rank list of 250 proteins in each cardiovascular disease by text-mining score Input GeneID/Uniprot ID of top 50 scoring proteins into Analysis Pipeline of Reactome Assess results for 21 main biological processes Obtain P-value and FDR for overrepresentation test for 21 largest biological processes Biological Functions and Pathways of the Top 50 CaseOLAP Scoring proteins over 6 CVD Groups Biological Process AR CHD Circadian Clock 0.422 (1) 0.391 (1) Developmental Biology 0.043 (14) 0.081 (12) Hemostasis 0.000 (20) Neuronal System CM CVA IHD VD 0.108 (2) 0.379 (1) 0.402 (1) 0.180 (10) 0.253 (11) 0.063 (12) 0.054 (13) 0.000 (13) 0.007 (10) 0.000 (15) 0.000 (21) 0.001 (13) 0.302 (4) 0.462 (3) 0.208 (4) 0.944 (1) 0.700 (2) 0.487 (3) Signal Transduction 0.503 (21) 0.037 (26) 0.084 (23) 0.021 (30) 0.012 (27) 0.142 (24) Immune System 0.592 (18) 0.992 (9) 0.007 (25) 0.047 (26) 0.038 (23) 0.476 (18) Disease 0.400 (10) 0.056 (13) 0.128 (11) 0.819 (7) 0.243 (10) 0.213 (11) na (0) DNA Repair na (0) 0.884 (1) na (0) na (0) na (0) 0.894 (1) Chromatin organization na (0) 0.826 (1) na (0) na (0) na (0) na (0) Metabolism 0.743 (15) 0.182 (19) 0.874 (11) 0.475 (18) 0.002 (26) 0.897 (12) DNA Replication 0.582 (1) 0.185 (2) 0.521 (1) 0.223 (2) 0.531 (1) 0.559 (1) Transmembrane transport of small molecules 0.304 (7) 0.373 (6) 0.177 (7) 0.924 (3) 0.101 (8) 0.408 (6) Gene Expression 0.118 (19) 0.314 (15) 0.878 (9) 0.053 (21) 0.991 (6) 0.119 (18) Cell Cycle 0.869 (3) 0.273 (6) 0.921 (2) 0.879 (3) 0.799 (3) 0.946 (2) na (0) 0.925 (1) na (0) na (0) 0.917 (1) 0.932 (1) 0.169 (3) 0.000 (8) 0.049 (10) 0.726 (5) Organelle biogenesis and maintenance Muscle contraction 0.000 (10) 0.004 (6) 0.000 (7) na (0) Vesicle-mediated transport 0.306 (8) 0.694 (5) 0.797 (4) 0.900 (4) na (0) 0.638 (1) na (0) na (0) na (0) Extracellular matrix organization 0.031 (6) 0.020 (6) 0.000 (9) 0.003 (8) 0.004 (7) 0.000 (13) Cellular responses to stress 0.660 (3) 0.011 (8) 0.024 (7) 0.132 (6) 0.073 (6) 0.961 (1) Programmed Cell Death 0.755 (1) 0.720 (1) 0.331 (2) 0.420 (2) 0.343 (2) 0.734 (1) Cell-Cell communication 0.652 (1) P-value Summary • We have devised a natural language processing approach to annotate vast amounts of textual data from published manuscripts on CVD for statistical pattern learning, extraction of relevant information and application of predictive analytics. • A combination of phrase-mining algorithms and a large-scale network embedding technique is effective to recognize patterns and extract relevant information, providing novel biomedical insights regarding relationships among 25 proteins and 6 major CVDs. • This novel data acquisition strategy may also be suitable for the vast amount of accumulated patient information in Clinical Case Reports and Electronic Health Records. Acknowledgements University of Illinois at Urbana-Champaign: • • • • • • Professor Jia Wei Han Doris Xin Meng Qu Xuan Wang Fangbo Tao Po-Wei Chan University of California, Los Angeles: • • • • • David Liem Vincent Kyi Leah Briscoe Travis Cao Brian Bleakley