Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman Motivation: Multidimensional data 209619_at 32541_at 206398_s_at 219281_at 207857_at 211338_at 213539_at 221497_x_at 213958_at 210835_s_at 209199_s_at 217979_at 201015_s_at 203332_s_at 204670_x_at 208788_at 210784_x_at 204319_s_at 205049_s_at 202114_at 213792_s_at 203932_at 203963_at 203978_at 203753_at 204891_s_at 209365_s_at 209604_s_at 211005_at 219686_at 38521_at 217853_at 217028_at 201137_s_at 202284_s_at 201999_s_at Pat1 7758 280 1050 391 1425 37 124 120 179 203 758 570 533 649 5577 648 142 298 3294 833 646 1977 97 315 1468 78 472 772 49 694 775 367 4926 4733 600 897 Pat2 4705 387 835 593 977 27 197 86 225 144 1234 563 343 354 3216 327 151 172 1351 674 375 1016 63 279 1105 71 519 74 58 342 604 168 2667 2846 1823 959 Pat3 5342 392 1268 298 2027 28 454 175 449 197 833 972 325 494 5323 1057 144 200 2080 733 370 2436 77 221 381 152 365 130 129 345 305 107 3542 1834 1657 800 Pat4 7443 238 1723 265 1184 38 116 99 174 314 1449 796 270 554 4423 746 173 298 2066 1298 436 1856 136 260 1154 74 349 216 70 502 563 160 5163 5471 1177 808 Pat5 8747 385 1377 491 939 33 162 115 185 250 769 869 691 710 5771 541 148 196 3726 862 738 1917 85 227 980 127 756 108 56 960 542 287 4683 5079 972 297 Pat6 4933 329 804 517 814 16 113 80 203 353 1110 494 460 455 3374 270 145 104 1396 371 497 822 74 222 1419 57 528 311 77 403 543 264 3281 2330 2303 1014 Pat7 7950 337 1846 334 658 36 97 83 186 173 987 673 563 748 4328 361 131 144 2244 886 546 1189 91 232 1253 66 637 80 61 535 725 273 4822 3345 1574 998 Pat8 5031 163 1180 387 593 23 97 119 185 285 638 1013 321 392 3515 774 146 110 2142 501 406 1092 61 141 554 153 828 235 61 513 587 113 3978 1460 1731 663 Outline • Dimension reduction – PCA – Clustering • Classification • Example: study of childhood leukemia Childhood Leukemia • • • • • Cancer in the cells of the immune system Approx. 35 new cases in Denmark every year 50 years ago – all patients died Today – approx. 78% are cured Riskgroups – Standard – Intermediate – High – Very high – Extra high • Treatment – Chemotherapy – Bone marrow transplantation – Radiation Prognostic Factors Good prognosis Poor prognosis Immunophenotype precursor B T Age 1-9 ≥10 Leukocyte count Low (<50*109/L) High (>100*109/L) Number of chromosomes Hyperdiploidy (>50) Hypodiploidy (<46) Translocations t(12;21) t(9;22), t(1;19) Treatment response Good response Poor response Study of Childhood Leukemia • Diagnostic bone marrow samples from leukemia patients • Platform: Affymetrix Focus Array – 8763 human genes • Immunophenotype – 18 patients with precursor B immunophenotype – 17 patients with T immunophenotype • Outcome 5 years from diagnosis – 11 patients with relapse – 18 patients in complete remission Principal Component Analysis (PCA) • used for visualization of complex data • developed to capture as much of the variation in data as possible Principal components • 1. principal component (PC1) – the direction along which there is greatest variation • 2. principal component (PC2) – the direction with maximum variation left in data, orthogonal to the 1. PC • General about principal components – linear combinations of the original variables – uncorrelated with each other Principal components PCA - example PCA on all Genes Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2 Outcome: PCA on all Genes Principal components - Variance 25 Variance (%) 20 15 10 5 0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 Clustering methods • Hierarchical – agglomerative (buttom-up) eg. UPGMA - divisive (top-down) • Partitioning – eg. K-means clustering Hierarchical clustering • Representation of all pairwise distances • Parameters: none (distance measure) • Results: – in one large cluster – hierarchical tree (dendrogram) • Deterministic Hierarchical clustering – UPGMA Algorithm • • • • Assign each item to its own cluster Join the nearest clusters Reestimate the distance between clusters Repeat for 1 to n Hierarchical clustering Hierarchical clustering Data with clustering order and distances Dendrogram representation Leukemia data - clustering of patients Leukemia data - clustering of genes K-means clustering • Partition data into K clusters • Parameter: Number of clusters (K) must be chosen • Randomilized initialization: – different clusters each time K-means - Algorithm • Assign each item a class in 1 to K (randomly) • For each class 1 to K – Calculate the centroid (one of the K-means) – Calculate distance from centroid to each item • Assign each item to the nearest centroid • Repeat until no items are re-assigned (convergence) K-means clustering, K=3 K-means clustering, K=3 K-means clustering, K=3 K-means clustering of Leukemia data Comparison of clustering methods • Hierarchical clustering – Distances between all variables – Timeconsuming with a large number of gene – Advantage to cluster on selected genes • K-mean clustering – Faster algorithm – Does not show relations between all variables N d ( xi , yi ) ( xi yi ) 2 i 1 ½ Distance measures • Euclidian distance ½ 2 d ( xi , yi ) ( xi yi ) i 1 N • Vector angle distance x y x y d ( xi , yi ) 1 cos 1 i i 2 i 2 i • Pearsons distance d ( xi , yi ) 1 CC 1 ( x x)( y y) ( x x) ( y y ) i i 2 i i 2 Comparison of distance measures Classification • • • • Feature selection Classification methods Cross-validation Training and testing Reduction of input features • Dimension reduction – PCA • Feature selection (gene selection) – Significant genes: t-test – Selection of a limited number of genes Microarray Data Class precursorB T ... Patient1 Patient2 Patient3 Patient4 ... Gene1 1789.5 963.9 2079.5 3243.9 ... Gene2 46.4 52.0 22.3 27.1 ... Gene3 215.6 276.4 245.1 199.1 ... Gene4 176.9 504.6 420.5 380.4 Gene5 4023.0 3768.6 4257.8 4451.8 Gene6 12.6 12.1 37.7 38.7 ... ... ... ... ... Gene8793 312.5 415.9 1045.4 1308.0 ... Outcome: PCA on Selected Genes Outcome Prediction: CC against the Number of Genes 0.9 Correlation coefficient (CC) 0.7 Nearest centroid 0.5 0.3 0.1 -0.1 0 20 40 60 80 100 Number of top ranking genes 120 140 Nearest Centroid • Calculation of a centroid for each class xik jC xij / nk k • Calculation of the distance between a test sample and each class cetroid • Class prediction by the nearest centroid method K-Nearest Neighbor (KNN) • Based on distance measure – For example Euclidian distance • Parameter k = number of nearest neighbors – k=1 – k=3 – k=... • Prediction by majority vote for odd numbers Neural Networks Gene2 Input layer Input Gene6 Gene8793 Input neuron Hidden neuron Hidden Hidden neurons layer Output Output layer Output neuron B T ... Comparison of Methods Nearest centroid Neural networks KNN Simple methods Based on distance calculation Good for simple problems Good for few training samples Distribution of data assumed Advanced methods Involve machine learning Several adjustable parameters Many training samples required (eg. 50-100) Flexible methods Cross-validation Data: 10 samples Cross-5-validation: Training: 4/5 of data (8 samples) Testing: 1/5 of data (2 samples) -> 5 different models Leave-one-out cross-validation (LOOCV) Training: 9/10 of data (9 samples) Testing: 1/10 of data (1 sample) -> 10 different models Validation • Definition of – true and false positives – true and false negatives Actual class Predicted class B B TP B T FN T T TN T B FP Accuracy • Definition: TP + TN TP + TN + FP + FN • Range: 0 – 100% Matthews correlation coefficient • Definition: TP·TN - FN·FP √(TN+FN)(TN+FP)(TP+FN)(TP+FP) • Range: (-1) – 1 Overview of Classification Expression data Subdivision of data for cross-validation into training sets and test sets Feature selection (t-test) Dimension reduction (PCA) Training of classifier: - using cross-validation - choice of method - choice of optimal parameters Testing of classifier Independant test set Important Points • Avoid overfitting • Validate performance – Test on an independant test set – by using cross-validation • Include feature selection in cross-validation Why? – To avoid overestimation of performance! – To make a general classifier Study of Childhood Leukemia: Results • Classification of immunophenotype (precursorB og T) – 100% accuracy • During the training • When testing on an independant test set – Simple classification methods applied • K-nearest neighbor • Nearest centroid • Classification of outcome (relapse or remission) – 78% accuracy (CC = 0.59) – Simple and advanced classification methods applied Risk classification in the future ? Patient: Prognostic factors: Risk group: Clincal data Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Standard Intermediate High Very high Ekstra high Immunophenotyping Morphology Genetic measurements Microarray technology Custom designed treatment Summary • Dimension reduction important to visualize data – Principal Component Analysis – Clustering • Hierarchical • Partitioning (K-means) (distance measure important) • Classification – Reduction of dimension often nessesary (t-test, PCA) – Several classification methods avaliable – Validation Coffee break