Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Big Data in Health Care Vipin Kumar Department of Computer Science University of Minnesota [email protected] www.cs.umn.edu/~kumar Introduction Mining Big Data: Motivation Today’s digital society has seen enormous data growth in both commercial and scientific databases Data Mining is becoming a commonly used tool to extract information from large and complex datasets Homeland Security Examples: Helps provide better customer service in business/commercial setting Helps scientists in hypothesis formation Scientific Data Geo-spatial data Sensor Networks Business Data Computational Simulations Data Mining for Life and Health Sciences Recent technological advances are helping to generate large amounts of both medical and genomic data • • High-throughput experiments/techniques - Gene and protein sequences - Gene-expression data - Biological networks and phylogenetic profiles Electronic Medical Records - IBM-Mayo clinic partnership has created a DB of 5 million patients - Single Nucleotides Polymorphisms (SNPs) Data mining offers potential solution for analysis of large-scale data • • • Automated analysis of patients history for customized treatment Prediction of the functions of anonymous genes Identification of putative binding sites in protein structures for drugs/chemicals discovery Protein Interaction Network Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • Traditional Techniques may be unsuitable due to Statistics/ Machine Learning/ – Enormity of data – High dimensionality of data – Heterogeneous, distributed nature of data AI Pattern Recognition Data Mining Database systems Data Mining as Part of the Knowledge Discovery Process Data Mining Tasks... Data 10 Milk Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 60K Predictive Modeling: Classification Predicting Survival using SNPs • Given a SNP data set of Myeloma patients, build a classification model that differentiates cases from control. • • • 3404 SNPs selected from various regions of the chromosome 70 cases (Patients survived shorter than 1 year) 73 Controls (Patients survived longer than 3 years) 3404 SNPs cases Controls Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil • Predicting functions of proteins • Classifying credit card transactions as legitimate or fraudulent • Categorizing news stories as finance, weather, entertainment, sports, etc • Identifying intruders in the cyberspace General Approach for Building a Classification Model 1 Yes Undergrad # years at present address 7 2 No Graduate 3 ? 3 Yes High School 2 ? … … … … … Tid Employed 1 Yes Graduate # years at present address 5 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … … Tid Employed Level of Education Credit Worthy Yes Level of Education Credit Worthy 10 Test Set 10 Training Set Learn Classifier Model ? Commonly Used Classification Models • Base Classifiers – – – – – – Decision Tree based Methods Rule-based Methods Nearest-neighbor Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines • Ensemble Classifiers – Boosting, Bagging, Random Forests Classification Model: Decision Tree Model for predicting credit worthiness Employed Class 1 Yes Graduate # years at present address 5 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … … Tid Employed 10 Level of Education Yes Yes No Credit Worthy No Education Graduate Yes { High school, Undergrad } Number of years > 7 yrs < 7 yrs Yes No Constructing a Decision Tree Employed 10 10 1 Yes Graduate # years at present address 5 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes 5 Yes Graduate 2 No 6 No High School 2 No 7 Yes Undergrad 3 No 8 Yes Graduate 8 Yes 9 Yes High School 4 Yes 10 No Graduate 1 No Tid Employed Level of Education Credit Worthy Yes Yes Worthy: 4 Not Worthy: 3 No Education Worthy: 0 Not Worthy: 3 Graduate Worthy: 2 Not Worthy: 2 Not Worthy Worthy Key Computation Employed = Yes 4 3 Employed = No 0 3 High School/ Undergrad Worthy: 2 Not Worthy: 4 Constructing a Decision Tree 10 Tid Employed 10 Tid Employed Level of Education # years at Credit present Worthy address 5 Yes 1 Yes Graduate 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes 5 Yes Graduate 2 No 6 No High School 2 No 7 Yes Undergrad 3 No Employed = Yes Level of Education # years at Credit present Worthy address 5 Yes 1 Yes Graduate 2 Yes High School 2 No 4 Yes High School 10 Yes 5 Yes Graduate 2 No 7 Yes Undergrad 3 No 8 Yes Graduate 8 Yes 9 Yes High School 4 Yes 10 8 Yes Graduate 8 Yes 9 Yes High School 4 Yes 10 No Graduate 1 No Tid Employed Employed = No # years at Credit present Worthy address Undergrad 1 No Level of Education 3 No 6 No High School 2 No 10 No Graduate 1 No Design Issues of Decision Tree Induction • How should training records be split? – Method for specifying test condition • depending on attribute types – Measure for evaluating the goodness of a test condition • How should the splitting procedure stop? – Stop splitting if all the records belong to the same class or have identical attribute values – Early termination How to determine the Best Split Greedy approach: – Nodes with purer class distribution are preferred Need a measure of node impurity: C0: 5 C1: 5 High degree of impurity C0: 9 C1: 1 Low degree of impurity Measure of Impurity: GINI • Gini Index for a given node t : GINI (t ) 1 [ p( j | t )]2 j (NOTE: p( j | t) is the relative frequency of class j at node t). – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information Measure of Impurity: GINI • Gini Index for a given node t : GINI (t ) 1 [ p( j | t )]2 j (NOTE: p( j | t) is the relative frequency of class j at node t). – For 2-class problem (p, 1 – p): • GINI = 1 – p2 – (1 – p)2 = 2p (1-p) C1 C2 0 6 Gini=0.000 C1 C2 1 5 Gini=0.278 C1 C2 2 4 Gini=0.444 C1 C2 3 3 Gini=0.500 Computing Gini Index of a Single Node GINI (t ) 1 [ p( j | t )]2 j C1 C2 0 6 P(C1) = 0/6 = 0 C1 C2 1 5 P(C1) = 1/6 C1 C2 2 4 P(C1) = 2/6 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 Computing Gini Index for a Collection of Nodes When a node p is split into k partitions (children) k ni GINI split GINI (i ) i 1 n where, ni = number of records at child i, n = number of records at parent node p. Choose the attribute that minimizes weighted average Gini index of the children Gini index is used in decision tree algorithms such as CART, SLIQ, SPRINT Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions: – Larger and Purer Partitions are sought for. Parent B? Yes Gini(N1) = 1 – (5/6)2 – (1/6)2 = 0.278 Gini(N2) = 1 – (2/6)2 – (4/6)2 = 0.444 No Node N1 Node N2 C1 C2 N1 5 1 N2 2 4 Gini=0.361 C1 7 C2 5 Gini = 0.486 Weighted Gini of N1 N2 = 6/12 * 0.278 + 6/12 * 0.444 = 0.361 Gain = 0.486 – 0.361 = 0.125 Continuous Attributes: Computing Gini Index Use Binary Decisions based on one value Several Choices for the splitting value – Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A < v and A v Simple method to choose best v – For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! Repetition of work. ID Home Owner Marital Status Annual Defaulted Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 Annual Income > 80K? Yes No ≤ 80 > 80 Yes 0 3 No 3 4 Decision Tree Based Classification Advantages: – – – – Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Robust to noise (especially when methods to avoid overfitting are employed) – Can easily handle redundant or irrelevant attributes (unless the attributes are interacting) Disadvantages: – Space of possible decision trees is exponentially large. Greedy approaches are often unable to find the best tree. – Does not take into account interactions between attributes – Each decision boundary involves only a single attribute Handling interactions + : 1000 instances o : 1000 instances Y X Entropy (X) : 0.99 Entropy (Y) : 0.99 Handling interactions + : 1000 instances o : 1000 instances Entropy (X) : 0.99 Entropy (Y) : 0.99 Entropy (Z) : 0.98 Y Adding Z as a noisy attribute generated from a uniform distribution Attribute Z will be chosen for splitting! X Z Z X Y Limitations of single attribute-based decision boundaries Both positive (+) and negative (o) classes generated from skewed Gaussians with centers at (8,8) and (12,12) respectively. Model Overfitting Classification Errors • Training errors (apparent errors) – Errors committed on the training set • Test errors – Errors committed on the test set • Generalization errors – Expected error of a model over random selection of records from same distribution Example Data Set Two class problem: + : 5200 instances • 5000 instances generated from a Gaussian centered at (10,10) • 200 noisy instances added o : 5200 instances • Generated from a uniform distribution 10 % of the data used for training and 90% of the data used for testing Increasing number of nodes in Decision Trees Decision Tree with 4 nodes Decision Tree Decision boundaries on Training data Decision Tree with 50 nodes Decision Tree Decision boundaries on Training data Which tree is better? Decision Tree with 4 nodes Which tree is better ? Decision Tree with 50 nodes Model Overfitting Underfitting: when model is too simple, both training and test errors are large Overfitting: when model is too complex, training error is small but test error is large Model Overfitting Using twice the number of data instances • If training data is under-representative, testing errors increase and training errors decrease on increasing number of nodes • Increasing the size of training data reduces the difference between training and testing errors at a given number of nodes Reasons for Model Overfitting • Lack of Representative Samples • Model is too complex – Multiple comparisons Effect of Multiple Comparison Procedure • Consider the task of predicting whether stock market will rise/fall in the next 10 trading days • Random guessing: P(correct) = 0.5 • Make 10 random guesses in a row: 10 10 10 8 9 10 P(# correct 8) 0.0547 10 2 Day 1 Up Day 2 Down Day 3 Down Day 4 Up Day 5 Down Day 6 Down Day 7 Up Day 8 Up Day 9 Up Day 10 Down Effect of Multiple Comparison Procedure • Approach: – Get 50 analysts – Each analyst makes 10 random guesses – Choose the analyst that makes the most number of correct predictions • Probability that at least one analyst makes at least 8 correct predictions P(# correct 8) 1 (1 0.0547)50 0.9399 Effect of Multiple Comparison Procedure • Many algorithms employ the following greedy strategy: – Initial model: M – Alternative model: M’ = M , where is a component to be added to the model (e.g., a test condition of a decision tree) – Keep M’ if improvement, (M,M’) > • Often times, is chosen from a set of alternative components, = {1, 2, …, k} • If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting Effect of Multiple Comparison - Example Use additional 100 noisy variables generated from a uniform distribution along with X and Y as attributes. Use 30% of the data for training and 70% of the data for testing Using only X and Y as attributes Notes on Overfitting • Overfitting results in decision trees that are more complex than necessary • Training error does not provide a good estimate of how well the tree will perform on previously unseen records • Need ways for incorporating model complexity into model development Evaluating Performance of Classifier • Model Selection – Performed during model building – Purpose is to ensure that model is not overly complex (to avoid overfitting) • Model Evaluation – Performed after model has been constructed – Purpose is to estimate performance of classifier on previously unseen data (e.g., test set) Methods for Classifier Evaluation • Holdout – Reserve k% for training and (100-k)% for testing • Random subsampling – Repeated holdout • Cross validation – Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n • Bootstrap – Sampling with replacement 1 b accboot 0.632 acci 0.368 accs – .632 bootstrap: b i 1 Application on Biomedical Data Application : SNP Association Study • Given: A patient data set that has genetic variations (SNPs) and their associated Phenotype (Disease). • Objective: Finding a combination of genetic characteristics that best defines the phenotype under study. SNP1 SNP2 … SNPM Disease Patient 1 1 1 … 1 1 Patient 2 0 1 … 1 1 Patient 3 1 0 … 0 0 … … … … … … Patient N 1 1 1 1 Genetic Variation in Patients (SNPs) as Binary Matrix and Survival/Disease (Yes/No) as Class Label. SNP (Single nucleotide polymorphism) • Definition of SNP (wikipedia) – A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more than 1 percent) of a large population Individual 1 Individual 2 Individual 3 Individual 4 Individual 5 AG C GTGAT C GAG G CTA AG C GTGAT C GAG G CTA AG C GTGAG C GAG G CTA AG C GTGAT C GAG G CTA AG C GTGAT C GAG G CTA Each SNP has 3 values ( GG / GT / TT ) ( mm / Mm/ MM) SNP – How many SNPs in Human genome? – 10,000,000 Why is SNPs interesting? • In human beings, 99.9 percent bases are same. • Remaining 0.1 percent makes a person unique. – Different attributes / characteristics / traits • how a person looks, • diseases a person develops. • These variations can be: – Harmless (change in phenotype) – Harmful (diabetes, cancer, heart disease, Huntington's disease, and hemophilia ) – Latent (variations found in coding and regulatory regions, are not harmful on their own, and the change in each gene only becomes apparent under certain conditions e.g. susceptibility to lung cancer) Issues in SNP Association Study • In disease association studies number of SNPs varies from a small number (targeted study) to a million (GWA Studies) • Number of samples is usually small • Data sets may have noise or missing values. • Phenotype definition is not trivial (ex. definition of survival) • Environmental exposure, food habits etc adds more variability even among individuals defined under the same phenotype • Genetic heterogeneity among individuals for the same phenotype Existing Analysis Methods • Univariate Analysis: single SNP tested against the phenotype for correlaton and ranked. – Feasible but doesn’t capture the existing true combinations. • Multivariate Analysis: groups of SNPs of size two or more are tested for possible association with the phenotype. – Infeasible but captures any true combinations. • These two approaches are used to identify biomarkers. • Some approaches employ classification methods like SVMs to classify cases and controls. Discovering SNP Biomarkers • Given a SNP data set of Myeloma patients, find a combination of 3404 SNPs SNPs that best predicts survival. • • • 3404 SNPs selected from various regions of the chromosome 70 cases (Patients survived shorter than 1 year) 73 Controls (Patients survived longer than 3 years) Complexity of the Problem: •Large number of SNPs (over a million in GWA studies) and small sample size •Complex interaction among genes may be responsible for the phenotype •Genetic heterogeneity among individuals sharing the same phenotype (due to environmental exposure, food habits, etc) adds more variability •Complex phenotype definition (eg. survival) cases Controls Discovering SNP Biomarkers • Given a SNP data set of Myeloma patients, find a combination of 3404 SNPs SNPs that best predicts survival. • • • cases 3404 SNPs selected from various regions of the chromosome 70 cases (Patients survived shorter than 1 year) 73 Controls (Patients survived longer than 3 years) Controls Odds ratio Measures whether two groups have the same odds of an event. OR = 1 Odds of event is equal in both groups OR > 1 Odds of event is higher in cases OR < 1 Odds of event is higher in controls Odds ratio is invariant to row and column scaling Biomarker (SNPs) Has Marker CLASS Lacks Marker CASE a b Control c d odds _ ratio a / b ad c / d bc Discovering SNP Biomarkers • Given a SNP data set of Myeloma patients, find a combination of 3404 SNPs SNPs that best predicts survival. • • • 3404 SNPs selected from various regions of the chromosome 70 cases (Patients survived shorter than 1 year) 73 Controls (Patients survived longer than 3 years) Complexity of the Problem: •Large number of SNPs (over a million in GWA studies) and small sample size •Complex interaction among genes may be responsible for the phenotype •Genetic heterogeneity among individuals sharing the same phenotype (due to environmental exposure, food habits, etc) adds more variability •Complex phenotype definition (eg. survival) cases Controls P-value • P-value – Statistical terminology for a probability value – Is the probability that the we get an odds ratio as extreme as the one we got by random chance – Computed by using the chi-square statistic or Fisher’s exact test • Chi-square statistic is not valid if the number of entries in a cell of the contingency table is small • p-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) if we are testing value is higher than expected by random chance using Fisher’s exact test • A statistical test to determine if there are nonrandom associations between two categorical variables. – P-values are often expressed in terms of the negative log of p-value, e.g., -log10(0.005) = 2.3 54 Discovering SNP Biomarkers • Given a SNP data set of Myeloma patients, find a combination of 3404 SNPs SNPs that best predicts survival. • • • cases 3404 SNPs selected from various regions of the chromosome 70 cases (Patients survived shorter than 1 year) 73 Controls (Patients survived longer than 3 years) Controls Highest p-value, moderate odds ratio Highest odds ratio, moderate p value Moderate odds ratio, moderate p value Example: High pvalue, moderate odds ratio Biomarker (SNPs) Has Marker CASE CLASS Control Lacks Marker (a) 40 (b) 30 (c) 19 (d) 54 Highest p-value, moderate odds ratio Odds ratio = (a*d)/(b*c) = (40 * 54) / (30 * 19) = 3.64 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 39, 143, 59, 70 ) log10(0.0243) = 3.85 56 Example … Biomarker (SNPs) Has Marker CASE CLASS Control Lacks Marker (a) 7 (b) 63 (c) 1 (d) 72 Odds ratio = (a*d)/(b*c) = (7 * 72) / (63* 1) = 8 Highest odds ratio, moderate p value P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 6, 143, 8, 70) log10(pvalue) = 1.56 57 Example … Biomarker (SNPs) Has Marker CLASS Lacks Marker CASE (a) 70 (b) 630 Control (c) 10 (d) 720 Odds ratio = (a*d)/(b*c) = (70 * 720) / (630* 10) = 8 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 60, 1430, 80, 700) log10(pvalue) = 6.56 x 10 Example … Biomarker (SNPs) Has Marker CLASS Lacks Marker CASE (a) 140 (b) 1260 Control (c) 20 (d) 1440 Odds ratio = (a*d)/(b*c) = (140 * 1440) / (1260* 20) = 8 P-value = 1 – hygecdf( a – 1, a+b+c+d, a+c, a+b ) = 1 – hygecdf( 139, 2860, 160, 1400) log10(pvalue) = 11.9 x 20 Issues with Traditional Methods Top ranked SNP: • Each SNP is tested and ranked individually -log10P-value = 3.8; Odds Ratio = 3.7 • Individual SNP associations with true phenotype are not distinguishable from random permutation of phenotype Van Ness et al 2009 However, most reported associations are not robust: of the 166 putative associations which have been studied three or more times, only 6 have been consistently replicated. Evaluating the Utility of Univariate Rankings for Myeloma Data Feature Selection Leave-one-out Cross validation With SVM Biased Evaluation Evaluating the Utility of Univariate Rankings for Myeloma Data Feature Selection Leave-one-out Cross validation With SVM Biased Evaluation Leave-one-out Cross validation with SVM Feature Selection Clean Evaluation Random Permutation test • • 10,000 random permutations of real phenotype generated. • For each one, Leave-one-out cross validation using SVM. Accuracy larger than 65% are highly significant. (p-value is < 10-4) Clustering Clustering • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Applications of Clustering • Applications: – Gene expression clustering – Clustering of patients based on phenotypic and genotypic factors for efficient disease diagnosis – Market Segmentation – Document Clustering – Finding groups of driver behaviors based upon patterns of automobile motions (normal, drunken, sleepy, rush hour driving, etc) Courtesy: Michael Eisen Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Similarity and Dissimilarity Measures • Similarity measure – Numerical measure of how alike two data objects are. – Is higher when objects are more alike. – Often falls in the range [0,1] • Dissimilarity measure – Numerical measure of how different are two data objects – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies • Proximity refers to a similarity or dissimilarity Euclidean Distance • Euclidean Distance n 2 ( x y ) k k dist ( x, y) k 1 Where n is the number of dimensions (attributes) and xk and yk are, respectively, the kth attributes (components) or data objects x and y. • Correlation n corr ( x, y ) (x k 1 k x )( yk y ) 2 n ( xk x ) k 1 n 2 2 ( y y ) k k 1 cov( x, y ) std ( x) std ( y ) Types of Clusterings • A clustering is a set of clusters • Important distinction between hierarchical and partitional sets of clusters • Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering – A set of nested clusters organized as a hierarchical tree p1 p2 p3 p4 Other Distinctions Between Sets of Clusters • Exclusive versus non-exclusive – In non-exclusive clusterings, points may belong to multiple clusters. – Can represent multiple classes or ‘border’ points • Fuzzy versus non-fuzzy – In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 – Weights must sum to 1 – Probabilistic clustering has similar characteristics • Partial versus complete – In some cases, we only want to cluster some of the data • Heterogeneous versus homogeneous – Clusters of widely different sizes, shapes, and densities Clustering Algorithms • K-means and its variants • Hierarchical clustering • Other types of clustering K-means Clustering • • • • • Partitional clustering approach Number of clusters, K, must be specified Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid The basic algorithm is very simple Example of K-means Clustering Iteration 6 1 2 3 4 5 3 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 K-means Clustering – Details • The centroid is (typically) the mean of the points in the cluster • Initial centroids are often chosen randomly – Clusters produced vary from one run to another • ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc • Complexity is O( n * K * I * d ) – n = number of points, K = number of clusters, I = number of iterations, d = number of attributes Evaluating K-means Clusters • Most common measure is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster – To get SSE, we square these errors and sum them K SSE dist 2 (mi , x ) i 1 xCi • x is a data point in cluster Ci and mi is the representative point for cluster Ci – Given two sets of clusters, we prefer the one with the smallest error – One easy way to reduce SSE is to increase K, the number of clusters Two different K-means Clusterings 3 Original Points 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 3 2.5 2.5 2 2 1.5 1.5 y y 3 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Optimal Clustering -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Sub-optimal Clustering Limitations of K-means • K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes • K-means has problems when the data contains outliers. Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters) Limitations of K-means: Differing Density Original Points K-means (3 Clusters) Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters) Hierarchical Clustering • Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram – A tree like diagram that records the sequences of merges or splits 1 3 5 2 0.2 1 2 0.15 3 0.1 4 0.05 0 4 3 6 2 5 4 1 5 6 Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters – Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level • They may correspond to meaningful taxonomies – Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) Hierarchical Clustering • Two main types of hierarchical clustering – Agglomerative: • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left – Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters) • Traditional hierarchical algorithms use a similarity or distance matrix – Merge or split one cluster at a time Agglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm is straightforward 1. 2. 3. 4. 5. 6. • Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters – Different approaches to defining the distance between clusters distinguish the different algorithms Starting Situation • Start with clusters of individual points and p1 p2 p3 p4 p5 ... a proximity matrix p1 p2 p3 p4 p5 . . . Proximity Matrix ... p1 p2 p3 p4 p9 p10 p11 p12 Intermediate Situation • After some merging steps, we have some clusters C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C5 C1 Proximity Matrix C2 C5 ... p1 p2 p3 p4 p9 p10 p11 p12 Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C5 Proximity Matrix C1 C2 C5 ... p1 p2 p3 p4 p9 p10 p11 p12 After Merging • The question is “How do we update the proximity C2 matrix?” U C1 C1 C4 C3 C4 ? ? ? ? C2 U C5 C3 C5 ? C3 ? C4 ? Proximity Matrix C1 C2 U C5 ... p1 p2 p3 p4 p9 p10 p11 p12 How to Define Inter-Cluster Distance p1 Similarity? p2 p3 p4 p5 p1 p2 p3 • • • • • p4 MIN p5 MAX . Group Average . Proximity Matrix Distance Between Centroids . Other methods driven by an objective function – Ward’s Method uses squared error ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 • • • • • p4 MIN p5 MAX . Group Average . Distance Between Centroids . Proximity Matrix Other methods driven by an objective function – Ward’s Method uses squared error ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 • • • • • MIN p5 MAX . . Group Average . Distance Between Centroids Proximity Matrix Other methods driven by an objective function – Ward’s Method uses squared error ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 • • • • • MIN p5 MAX . . Group Average . Distance Between Centroids Proximity Matrix Other methods driven by an objective function – Ward’s Method uses squared error ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 • • • • • MIN p5 MAX . . Group Average . Distance Between Centroids Proximity Matrix Other methods driven by an objective function – Ward’s Method uses squared error ... Other Types of Cluster Algorithms • Hundreds of clustering algorithms • Some clustering algorithms – K-means – Hierarchical – Statistically based clustering algorithms • Mixture model based clustering – Fuzzy clustering – Self-organizing Maps (SOM) – Density-based (DBSCAN) • Proper choice of algorithms depends on the type of clusters to be found, the type of data, and the objective Cluster Validity • For supervised classification we have a variety of measures to evaluate how good our model is – Accuracy, precision, recall • For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? • But “clusters are in the eye of the beholder”! • Then why do we want to evaluate them? – – – – To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters Clusters found in Random Data 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 y y Random Points 1 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.4 0.6 0.8 0 1 DBSCAN 0 0.2 0.4 x 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 y y K-means 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.6 0.8 1 x 0.4 0.6 x 0.8 1 0 Complete Link 0 0.2 0.4 0.6 x 0.8 1 Different Aspects of Cluster Validation • Distinguishing whether non-random structure actually exists in the data • Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels • Evaluating how well the results of a cluster analysis fit the data without reference to external information • Comparing the results of two different sets of cluster analyses to determine which is better • Determining the ‘correct’ number of clusters Using Similarity Matrix for Cluster Validation • Order the similarity matrix with respect to cluster labels and inspect visually. 1 1 0.9 0.8 0.7 Points y 0.6 0.5 0.4 0.3 0.2 0.1 0 10 0.9 20 0.8 30 0.7 40 0.6 50 0.5 60 0.4 70 0.3 80 0.2 90 0.1 100 0 0.2 0.4 0.6 x 0.8 1 20 40 60 Points 80 0 100 Similarity Using Similarity Matrix for Cluster Validation • Clusters in random data are not so crisp 1 10 0.9 0.9 20 0.8 0.8 30 0.7 0.7 40 0.6 0.6 50 0.5 0.5 60 0.4 0.4 70 0.3 0.3 80 0.2 0.2 90 0.1 0.1 100 20 40 60 80 0 100 Similarity Points y Points 1 0 0 0.2 0.4 0.6 x DBSCAN 0.8 1 Using Similarity Matrix for Cluster Validation • Clusters in random data are not so crisp 1 10 0.9 0.9 20 0.8 0.8 30 0.7 0.7 40 0.6 0.6 50 0.5 0.5 60 0.4 0.4 70 0.3 0.3 80 0.2 0.2 90 0.1 0.1 100 20 40 60 80 0 100 Similarity y Points 1 0 0 0.2 0.4 0.6 x Points K-means 0.8 1 Using Similarity Matrix for Cluster Validation • Clusters in random data are not so crisp 1 10 0.9 0.9 20 0.8 0.8 30 0.7 0.7 40 0.6 0.6 50 0.5 0.5 60 0.4 0.4 70 0.3 0.3 80 0.2 0.2 90 0.1 0.1 100 20 40 60 80 0 100 Similarity y Points 1 0 0 Points 0.2 0.4 0.6 x Complete Link 0.8 1 Measures of Cluster Validity • Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types of indices. – External Index: Used to measure the extent to which cluster labels match externally supplied class labels. • Entropy – Internal Index: Used to measure the goodness of a clustering structure without respect to external information. • Sum of Squared Error (SSE) – Relative Index: Used to compare two different clusterings or clusters. • Often an external or internal index is used for this function, e.g., SSE or entropy • For futher details please see “Introduction to Data Mining”, Chapter 8. – http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf Clustering Microarray Data Clustering Microarray Data • Microarray analysis allows the monitoring of the activities of many genes over many different conditions • Data: Expression profiles of approximately 3606 genes of E Coli are recorded for 30 experimental conditions • SAM (Significance Analysis of Microarrays) package from Stanford University is used for the analysis of the data and to identify the genes that are substantially differentially upregulated in the dataset – 17 such genes are identified for study purposes • Hierarchical clustering is performed and plotted using TreeView C1 C2 C3 C4 C5 C6 C7 Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 …. Clustering Microarray Data… CLUTO for Clustering for Microarray Data • CLUTO (Clustering Toolkit) George Karypis (UofM) http://glaros.dtc.umn.edu/gkhome/views/cluto/ • CLUTO can also be used for clustering microarray data Issues in Clustering Expression Data • Similarity uses all the conditions – We are typically interested in sets of genes that are similar for a relatively small set of conditions • Most clustering approaches assume that an object can only be in one cluster – A gene may belong to more than one functional group – Thus, overlapping groups are needed • Can either use clustering that takes these factors into account or use other techniques – For example, association analysis Clustering Packages • Mathematical and Statistical Packages – – – – MATLAB SAS SPSS R • CLUTO (Clustering Toolkit) George Karypis (UM) http://glaros.dtc.umn.edu/gkhome/views/cluto/ • Cluster Michael Eisen (LBNL/UCB) (microarray) • http://rana.lbl.gov/EisenSoftware.htm http://genome-www5.stanford.edu/resources/restech.shtml (more microarray clustering algorithms) Many others – KDNuggets http://www.kdnuggets.com/software/clustering.html Association Analysis Gene Expression Patterns in normal and cancer patients Association Analysis • Given a set of records, find dependency rules which will predict occurrence of an item based on occurrences of other items in the record TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk • Applications Rules Discovered: {Milk} --> {Coke} (s=0.6, c=0.75) {Diaper, Milk} --> {Beer} (s=0.4, c=0.67) Support, s # transacti ons that contain X and Y Total transacti ons # transacti ons that contain X and Y X Confidence , c – Marketing and Sales Promotion # transacti ons that contain – Supermarket shelf management – Traffic pattern analysis (e.g., rules such as "high congestion on Intersection 58 implies high accident rates for left turning traffic") Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold • Brute-force approach: Two Steps – Frequent Itemset Generation • Generate all itemsets whose support minsup – Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is computationally expensive Efficient Pruning Strategy (Ref: Agrawal & Srikant 1994) null If an itemset is infrequent, then all of its supersets must also be infrequent A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Found to be Infrequent ABCD Pruned supersets ABCE ABDE ABCDE ACDE BCDE Illustrating Apriori Principle Item Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 4 1 Items (1-itemsets) Minimum Support = 3 If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13 Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper} Count 3 2 3 2 3 3 Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Itemset {Bread,Milk,Diaper} Count 3 Association Measures • Association measures evaluate the strength of an association pattern – Support and confidence are the most commonly used – The support, (X), of an itemset X is the number of transactions that contain all the items of the itemset • Frequent itemsets have support > specified threshold • Different types of itemset patterns are distinguished by a measure and a threshold – The confidence of an association rule is given by conf(X Y) = (X Y) / (X) • Estimate of the conditional probability of Y given X • Other measures can be more useful – H-confidence – Interest Application on Biomedical Data Mining Differential Coexpression (DC) • Differential expression Differential coexpression • Differential Expression (DE) – Traditional analysis targets the changes of expression level cases Expression level controls Expression over samples in controls and cases [Golub et al., 1999], [Pan 2002], [Cui and Churchill, 2003] etc. Differential Coexpression (DC) • Differential Coexpression (DC) – Targets changes of the coherence of expression cases genes controls cases interesting, Question: Is this gene i.e. associated w/ the phenotype? controls Answer: No, in term of differential expression (DE). However, what if there are another two genes ……? Matrix of expression values Yes!& Spang, 2005] [Kostka Expression over samples in controls and cases Biological interpretations of DC: Dysregulation of pathways, mutation of transcriptional factors, etc. [Silva et al., 1995], [Li, 2002], [Kostka & Spang, 2005], [Rosemary et al., 2008], [Cho et al. 2009] etc. Differential Coexpression (DC) • Existing work on differential coexpression – Pairs of genes with differential coexpression • [Silva et al., 1995], [Li, 2002], [Li et al., 2003], [Lai et al. 2004] – Clustering based differential coexpression analysis • [Ihmels et al., 2005], [Watson., 2006] – Network based analysis of differential coexpression • [Zhang and Horvath, 2005], [Choi et al., 2005], [Gargalovic et al. 2006], [Oldham et al. 2006], [Fuller et al., 2007], [Xu et al., 2008] – Beyond pair-wise (size-k) differential coexpression • [Kostka and Spang., 2004], [Prieto et al., 2006] – Gene-pathway differential coexpression • [Rosemary et al., 2008] – Pathway-pathway differential coexpression • [Cho et al., 2009] Existing DC work is “full-space” • Full-space differential coexpression Full-space measures: e.g. correlation difference • May have limitations due to the heterogeneity of – Causes of a disease (e.g. genetic difference) – Populations affected (e.g. demographic difference) Motivation: Such subspace patterns may be missed by fullspace models Extension to Subspace Differential Coexpression • Definition of Subspace Differential Coexpression Pattern – A set of k genes – – = {g1, g2 ,…, gk} : Fraction of samples in class A, on which the k genes are coexpressed : Fraction of samples in class B, on which the k genes are coexpressed Problem: given n genes, find all the subsets of genes, s.t. SDC≥d as a measure of subspace differential coexpression Details in [Fang, Kuang, Pandey, Steinbach, Myers and Kumar, PSB 2010] Computational Challenge Problem: given n genes, find all the subsets of genes, s.t. SDC≥d null A AB AC AD B AE C D BC BD BE Given n genes, there are 2n E CD CE DE candidates of SDC pattern! How to effectively handle the combinatorial search space? ABC ABD ABE ABCD ACD ABCE ACE ADE ABDE BCD ACDE BCE BCDE BDE CDE Similar motivation and challenge as biclustering, but here differential biclustering ! ABCDE Direct Mining of Differential Patterns Refined SDC measure: “direct” >> A measure M is antimonotonic if V A,B: A B M(A) >= M(B) ≈ Details in [Fang, Kuang, Pandey, Steinbach, Myers and Kumar, PSB 2010] [Fang, Pandey, Gupta, Steinbach and Kumar, IEEE TKDE 2011] An Association-analysis Approach Refined SDC measure null Disqualified A B C D E A measure M is antimonotonic if V A,B: A B M(A) >= M(B) AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE ABCDE Prune all the supersets [ Agrawal et al. 1994] BCDE Advantages: 1) Systematic & direct 2) Completeness 3) Efficiency A 10-gene Subspace DC Pattern Enriched with the TNF-α/NFkB signaling pathway (6/10 overlap with the pathway, P-value: 1.4*10-5) ≈ 10% Suggests that the dysregulation of TNF-α/NFkB pathway may be related to lung cancer≈ 60% www. ingenuity.com: enriched Ingenuity subnetwork Data Mining Book For further details and sample chapters see www.cs.umn.edu/~kumar/dmbook Case Study 1 Neuroimaging Mining Neuroimaging Data • fMRI Data – – – – Activity at each location is captured Each location is a 2 mm cube (voxel) Time resolution 2 secs. Subject can be ‘resting’ or ‘working’ on task http://www.youtube.com/watch?v=2kgn7jFt1 vs Alternative representation Locations A voxel Time Time series Functional Magnetic Resonance Imaging • fMRI Data Activity at each location is captured Each location is a 2 mm cube (voxel) Time resolution 2 secs. Subject can be ‘resting’ or ‘working’ on task • One can study fMRI data to discover http://www.youtube.com/watch?v=2kgn7jFt1 vs voxel s – – – – – Brain’s operating principles at ‘rest’ [Heuvel et al 2010] time – Brain’s response to external stimuli [Mitchell et al 2008] – Differences between healthy and disease subjects [Atluri et al 2013] Healthy vs. Disease – disconnectivity? Atlas region s fMRI scan regions time Brain Network regions Lynall et al. 2010 regions Correlation matrix schizophrenia Healthy vs. Disease – univariate testing edges Healthy regions ● Significantly discriminating connections discovered from multiple datasets are found inconsistent Pettersson-Yeo et al. 2011 ● Are they similar when discovered from same subjects at different times? Healthy vs. Disease – inconsistency Significance 4 2 3 Significance region for Scan 2 1 7 Significance region for Scan 1 0 6 -log(p) in Scan 2 5 region for Scans 1 & 2 0 4 1 2 -log(p) in Scan 1 3 Potential reasons for lack of consistency • Noise + weak signal + small sample + large # connections • Dynamic connectivity On the bright side... There is similarity in connections Connections edges cluster1 cluster2 cluster3 . . . Cluster edges subjects subjects Connectivity Cluster Analysis Strengths: • Handles redundancy, small number of tests, statistical power • Handles noise by averaging connectivity across multiple edges • Could potentially increase reliability Test significance Healthy vs. Disease – consistent results Atluri et al. HBM 2014 • Consistent T1 vs. T2 • Hyper-connectivity in SZ • Thalamic connectivity Healthy vs. Disease – consistent results Atluri et al. HBM 2014 By leveraging the structure in the data weak and consistent signal can be captured! • Hyper-connectivity in SZ • Thalamic connectivity • Consistent T1 vs. T2 Case Study 2 Home Health Care Problem • In United States, 2010: – 4.9 million people required help to complete ADLs – 9.1 million people unable to complete IADLs 1 • Home Healthcare (HHC) – Spending in 1980 increased from $2.4 billion to $17.7 billion today – Report improved mobility in 46.9% adults before discharge from HHC 2 • Mobility is one component of functional status – Mobility affects functional status and functional disability – Less than one-third of older adults recover pre-hospital function 3 – Increased risk of falls in home, rehospitalization, disability, social isolation, loss of independence – Besides physical issues, also psychosocial issues, comorbidity and death 1. Adams, Martinez, Vickerie, and Kirzinger, 2011; 2. Agency for Healthcare Research and Quality, 2012; 3. Chen, Wang, and Huang, 2008 Purpose of Study • To discover patients and support system characteristics associated with the improved outcomes of mobility • Find new factors associated with mobility besides the existing knowledge (current ambulation status during admission) • In each subgroup of patients defined by current ambulation status during admission (1-5) – We started with group 2 and then compare the observations with other groups • To compare the predictors across each patient subgroup to find the consistent biomarkers in all subgroups and specific factors in each subgroup Overall steps OASISStandardizing data extracteddata, defrom EHRsidentifying from 270,634 patient, patients served missing by 581 values, imputing Medicare-certified binarizing home data into 98 healthcare agencies variables Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, pp. 37 – 54. http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf. P. 41 Comparison of the Single Variables The OR differs significantly among the groups defined by mobility score during admission Data mining techniques • We found the risk variables that are significantly associated with mobility outcome vary among the groups • Group the single predictors based on whether they cover same or different patient group – Clustering • Based on similarity of sample space • Not discriminative • High frequency variables got merged – Pattern mining based approach • Discriminative • Coherence (similarity of sample space) Variables Clustering Groups I H G F E D C Variables B A Patterns Associated with Improvement in Group 2 Healthier physiological and psychosocial elderly Older adults with no problems in daily activities Patterns Associated with No Improvement in Group 2 Incapable to toilet and transfer Paid Help Feeble groups with functional deficiency Cognitive deficits and behavioral problems Help with financial and legal matters Patterns Associated with Mobility in Group 1 Patterns Associated with Mobility in Group 3 Patterns Associated with Mobility in Group 4 Discussion • High prevalence of mobility limitations in HHC patients (97%) • Mobility status at admission highest predictor of improvement – CMS outcome reporting controling for this, but doesn’t look at differences by mobility status • Variations of predictors within subgroups • Different clusters point to the need to tailor interventions for subgroups Discussion (Cont.) • Single variables may be less helpful than patterns of variables – higher categories • Large national sample – but not random – May be bias in results • Length of stay may vary and contribute to findings • Results are knowledge discovery, not hypotheses testing • Integrate diagnosis codes (icd-9) and nursing interventions in future to combine factors related to mobility References • • Book Computational Approaches for Protein Function Prediction, Gaurav Pandey, Vipin Kumar and Michael Steinbach, to be published by John Wiley and Sons in the Book Series on Bioinformatics in Fall 2007 • • Conferences/Workshops Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study, Gaurav Pandey, Michael Steinbach, Rohit Gupta, Tushar Garg and Vipin Kumar, to appear, ACM SIGKDD 2007 • Incorporating Functional Inter-relationships into Algorithms for Protein Function Prediction, Gaurav Pandey and Vipin Kumar, to appear, ISMB satellite meeting on Automated Function Prediction 2007 • Comparative Study of Various Genomic Data Sets for Protein Function Prediction and Enhancements Using Association Analysis, Rohit Gupta, Tushar Garg, Gaurav Pandey, Michael Steinbach and Vipin Kumar, To be published in the proceedings of the Workshop on Data Mining for Biomedical Informatics, held in conjunction with SIAM International Conference on Data Mining, 2007 • Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery, Hui Xiong, X. He, Chris Ding, Ya Zhang, Vipin Kumar and Stephen R. Holbrook, pp 221-232, Proc. of the Pacific Symposium on Biocomputing, 2005 • Feature Mining for Prediction of Degree of Liver Fibrosis, Benjamin Mayer, Huzefa Rangwala, Rohit Gupta, Jaideep Srivastava, George Karypis, Vipin Kumar and Piet de Groen, Proc. Annual Symposium of American Medical Informatics Association (AMIA), 2005 • • Technical Reports Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study, Gaurav Pandey, Michael Steinbach, Rohit Gupta, Tushar Garg, Vipin Kumar, Technical Report 07-007, March 2007, Department of Computer Science, University of Minnesota • Computational Approaches for Protein Function Prediction: A Survey, Gaurav Pandey, Vipin Kumar, Michael Steinbach, Technical Report 06-028, October 2006, Department of Computer Science, University of Minnesota