Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Overview of Biomedical Informatics Vipin Kumar University of Minnesota [email protected] www.cs.umn.edu/~kumar Team Members: Michael Steinbach, Rohit Gupta, Gowtham Atluri, Gang Fang, Gaurav Pandey, Sanjoy Dey, Vanja Paunic Collaborators: Brian Van Ness, Bill Oetting, Gary L. Nelsestuen, Christine Wendt, Piet C. de Groen, Michael Wilson Research Supported by NSF, IBM, BICB-UMR, Pfizer Nov 12th, 2009 Understanding Biotechnology – The Science of the ‘Omics’ Biomedical Informatics Recent technological advances are helping to generate large amounts of biomedical data • • Data from high-throughput experimental techniques - Gene expression data - Biological networks - Proteomics and metabolomics data - Single Nucleotides Polymorphism (SNP) data Electronic Medical Records - IBM-Mayo clinic partnership has created a DB of 5 million patients Great potential benefits from the analysis of these large-scale data sets: • • • Automated analysis of patients history for customized treatment Discovery of biomarkers for complex diseases and other phenotypes Cheminformatics and drug discovery 2 Large-scale Data is Everywhere! There has been enormous data growth in both commercial and scientific databases due to advances in data generation and collection technologies New mantra Homeland Security Gather whatever data you can whenever and wherever possible. Expectations Gathered data will have value either for the purpose collected or for a purpose not envisioned. Scientific Data Geo-spatial data Sensor Networks Business Data Computational Simulations Data Mining • Automated techniques for analyzing large data sets. • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems. Data 10 Milk Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes Predictive Modeling: Classification • Find a model for class attribute as a function of the values of other attributes Model for predicting credit worthiness Class 1 Yes Graduate # years at present address 5 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes Tid Employed … … Level of Education … … Employed Credit Worthy Yes No Yes No Education Graduate … { High school, Undergrad } 10 Number of years Number of years > 3 yr < 3 yr > 7 yrs < 7 yrs Yes No Yes No Discovering biomarkers • Gene Expression Data • Given: n labeled subjects, each with expression levels of p genes • Objectives: build a predictive model to identify cancer subtypes Genes Classical study of cancer subtypes Golub et al. (1999) identification of diagnostic genes • SNP Data • Given: n labeled subjects, each with genotypes of p SNPs • Objectives: build a model using genotypes to predict labels. …….. ……. SNP 1 SNP 2 SNP 3 Patient 1 AC GT AA 1 Patient 2 AA GG GG 0 ……… Patient n Class .. CC GG AG 1 Predicting short-term vs. long-term survivors among myeloma subjects • • • 3404 SNPs (Selected according to potential relevance to Myeloma) Cases: 70 Patients who survived shorter than 1 year Controls: 73 Patients survived longer than 3 years Brian Van Ness et al, Genomic Variation in Myeloma: Design, content and initial application of the Bank On A Cure SNP Panel to detect associations with progression free survival, BMC Medicine, Volume 6, pp 26, 2008. controls cases SNPs Clustering Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups • Applications: – Finding groups of similar genes or proteins based upon their expression profiles – Clustering of patients based on phenotypic and genotypic factors for efficient disease diagnosis – Market Segmentation – Document Clustering Courtesy: Michael Eisen Michael Eisen et al, 1999 Association Pattern Discovery • Given a set of records each of which contain some number of items from a given collection; – Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} • Biological applications – Identifying functional modules in protein interaction networks – Identifying transcription modules in gene expression data – Identifying biological entities associated with disease phenotypes • Biomarker discovery from genomic data, e.g. gene expression, Single-nucleotide polymorphism(SNP), metabolite data etc. Discovery of Discriminative Patterns from Lung Cancer Gene Expression Data • 67 Normal samples, 102 cancer patients, 8787 genes [Stearman et al. 2005], [Su et al. 2007], [Bhattacharjee et al. 2001] • Visualization of a size-10 pattern using a new discriminative pattern finding technique Enriched with the TNF/NFkB signaling pathway which is well-known to be related to lung cancer P-value: 1.4*10-5 (6/10 overlap with the pathway) Gang Fang, Rui Kuang, Gaurav Pandey, Michael Steinbach, Chad L. Myers and Vipin Kumar, Subspace Differential Coexpression Analysis: Problem Definition and A General Approach, In the Proceedings of the 15th Pacific Symposium on Biocomputing (PSB), pp. 145-156, 2010. Discriminative Metabolite Patterns from Liver Cirrhosis Data • 41 alcoholic liver cirrhosis (row 1-41), 19 controls (row 42-60), 3610 metabolites – • Data from Gary Nelsestuen et al. A sample group of five metabolites having very similar (in relative terms) intensity values in cases, but mostly absent in controls. – (a) The rank values (black is 10, white is 0), – (b) original intensity values. Gaurav Pandey, Gowtham Atluri, Michael Steinbach, Chad L. Myers and Vipin Kumar, An Association Analysis Approach to Biclustering, Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 677-686, 2009. (a) (b) Summary • Data mining techniques hold great promise for data-driven hypothesis generation in the biomedical domain. • Ample scope exists for the development and application of novel techniques for the analysis of different types of biomedical data. For further information… • Visit www.cs.umn.edu/~kumar/dmbio. • Send email to [email protected]. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining, AddisonWesley, 2005.