Download ibm_rochester_talk_may_2005

Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer Science [email protected] www.cs.umn.edu/~kumar Michael Steinbach Department of Computer Science Increasing Amounts of Medical and Genomic Data  Electronic medical records are becoming increasingly common – Automated analysis of patient information is now possible  Obtaining genomic information is increasingly affordable – SNPs offer the potential of tests for disease or susceptibility for disease © Vipin Kumar May 18, 2006 ‹#› Various Techniques Currently Used to Analyze the Relationship Between SNPs and Disease  Statistical Association Analysis – Chi-Square and Odds ratio  Logistic Regression  Multifactor Dimensionality Reduction – Attribute selection and construction followed by classification © Vipin Kumar May 18, 2006 ‹#› Challenges Facing Current Techniques  Statistical Association Analysis – Often lacks power, even for single SNP association – Combinatorial explosion when used for screening pairs, triples, etc.  Logistic Regression – Does not work well when many attributes  Multifactor Dimensionality Reduction – Impractical for many attributes due to combinatorial nature  General Challenges – Disease or susceptibility to disease often results from interactions among many genetic, phenotypic, and environmental factors – Noise and nonlinear interactions – The Challenges of Whole-Genome Approaches to Common Diseases, Moore and Ritchie, JAMA, 2004 © Vipin Kumar May 18, 2006 ‹#› General Approach Using Data Mining Techniques  Create a data set that records the presence and absence of – Phenotypic characteristics – Genetic characteristics (SNPs) – Disease  Apply association analysis to find groups of phenotypic and genetic characteristics that are highly associated with disease – Uses characteristics of the patterns to prune the search space  Clustering and classification can also be applied © Vipin Kumar May 18, 2006 ‹#› Traditional Association Analysis  Association analysis: Analyzes relationships among items (attributes) in a binary transaction data – Example data: market basket data – Data can be represented as a binary matrix Set-Based Representation of Data TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke – Applications in business and science  Example: Milk  Diaper © Vipin Kumar May 18, 2006 Beer Eggs Coke – Association Rules: X  Y, where X and Y are itemsets. 1 2 3 4 5 Diapers – Itemsets: Collection of items  Example: {Milk, Diaper} Milk Two types of patterns Bread  1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 1 0 1 Binary Matrix Representation of Data ‹#› The Need for Error-Tolerant Itemsets  An error-tolerant itemset (ETI) can have a fraction  of the items missing in each transaction. Example: see the data in the table – Let  = 1/4. In other words, each transaction needs to have 3/4 (75%) of the items. – X = {i1, i2, i3, i4} and Y = {i5, i6, i7, i8} are both ETIs with a support of 4.  Algorithms to find ETIs are still in development  You can think of these ETIs as blocks in the data matrix © Vipin Kumar May 18, 2006 ‹#› ETIs in For Finding Patterns in Phenotypic and Genomic Data  ETIs consist of – A set of patients and – A set of attributes such that – The block is relatively dense   These blocks identify sets of patients that are highly associated with certain sets of attributes and vice-versa If most of these patients share a disease, then these attributes (genetic and/or phenotypic) are candidate markers for the disease © Vipin Kumar May 18, 2006 X: Set of patients Y: Set of attributes, i.e., SNPs, medical characteristics ‹#› Example: Using ETIs to Find Rules   Can use ETIs to find better association rules Example: Mushroom data set – Classifies 8192 mushrooms as poisonous (3916) or not (4208) – 117 other attributes, such as color, odor, etc.  Comparison of rules based on frequent itemsets or ETIs – One rule is {29, 48, 90} → p  Individual correlations are 0.62, 0.54, 0.18  With traditional association analysis, this rule has a confidence of 1 and a support of 576. – This corresponds to a correlation of 0.18  If we require only two of the three items, we still have a confidence of 1, but the support is 3312. – This corresponds to a correlation of 0.86 – Maximum single item correlation is 0.78 © Vipin Kumar May 18, 2006 ‹#› Generalizing ETIs to Blocks in a Data Matrix  Dense blocks consist of – A set of objects and – A set of attributes such that – The block is relatively dense   These blocks identify sets of objects that are highly associated with certain sets of attributes and vice-versa For gene expression data, this identifies transcription modules – Group of genes that are coexpressed together under a set of conditions © Vipin Kumar May 18, 2006 X: Genes Y: Conditions under which genes are expressed Entries of matrix are the level of a gene’s expression under a condition ‹#› Techniques for Finding Relatively Dense Blocks in a Data Matrix  Algorithms for finding Error-Tolerant Itemsets – More work needed to develop algorithms – We are currently using support envelopes – Support Envelopes: A Technique for Exploring the Structure of Association Patterns, Steinbach, Tan, Kumar, KDD, 2004  Subspace clustering – Similar in spirit to association analysis – Subspace clustering for high dimensional data: A review, Parsons , Haque , Liu SIGKDD Explorations, 2004  Co-clustering – Information-Theoretic Co-Clustering, Dhillon, Mallela, Modha, KDD 2003 – We are currently exploring approaches to extend co-clustering for ETIs  Variety of other approaches – Matrix factorization – Graph-partitioning © Vipin Kumar May 18, 2006 ‹#› Support Envelopes • A support envelope contains all association patterns involving m or more transactions and n or more items • By association patterns we mean • Itemsets and variants (frequent, maximal, closed) • Error Tolerant Itemsets (ETIs) An example of a support envelope involving characteristics of mushrooms. One of the attributes is, ‘gill-color:buff’, which occurs in 1728 records, every one of which occurs with 13 other items (one of which is the attribute,‘poisonous’) © Vipin Kumar May 18, 2006 ‹#› Visualizing Support Envelopes for Mushroom  One of the support envelopes (576, 23) is denser than its surrounding neighbors. © Vipin Kumar May 18, 2006 ‹#› Using Weak Associations to Find Patterns  Apply the following principle: If most pairs of attributes or objects in a set have a pairwise connections, then there is likely to be a strong association among them even if the pairwise associations are weak.  The hyperclique pattern uses this principle – Hyperclique Pattern Discovery, Xiong Tan, and Vipin Kumar, to appear DMKD, 2006. – Good for removing noise   Enhancing Data Analysis with Noise Removal, Hui Xiong, Gaurav Pandey, Michael Steinbach, Vipin Kumar, TKDE, 2006 We have recently developed more general pairwise patterns – Custom Itemset Patterns, Steinbach and Kumar, submitted to KDD 2006 © Vipin Kumar May 18, 2006 ‹#› Application of Classification Techniques  Techniques must work with noisy, sparse, highdimensional data  Success of multifactor dimensionality reduction indicates the usefulness of attribute selection and attribute creation  ETI and related patterns offer an alternative for feature extraction  Classification based on association analysis  SVM with the proper kernel © Vipin Kumar May 18, 2006 ‹#›

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ibm_rochester_talk_may_2005