Download Lecture notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
PCA, Clustering
and Classification
by Agnieszka S. Juncker
Part of the slides is adapted from Chris Workman
Motivation: Multidimensional data
209619_at
32541_at
206398_s_at
219281_at
207857_at
211338_at
213539_at
221497_x_at
213958_at
210835_s_at
209199_s_at
217979_at
201015_s_at
203332_s_at
204670_x_at
208788_at
210784_x_at
204319_s_at
205049_s_at
202114_at
213792_s_at
203932_at
203963_at
203978_at
203753_at
204891_s_at
209365_s_at
209604_s_at
211005_at
219686_at
38521_at
217853_at
217028_at
201137_s_at
202284_s_at
201999_s_at
Pat1
7758
280
1050
391
1425
37
124
120
179
203
758
570
533
649
5577
648
142
298
3294
833
646
1977
97
315
1468
78
472
772
49
694
775
367
4926
4733
600
897
Pat2
4705
387
835
593
977
27
197
86
225
144
1234
563
343
354
3216
327
151
172
1351
674
375
1016
63
279
1105
71
519
74
58
342
604
168
2667
2846
1823
959
Pat3
5342
392
1268
298
2027
28
454
175
449
197
833
972
325
494
5323
1057
144
200
2080
733
370
2436
77
221
381
152
365
130
129
345
305
107
3542
1834
1657
800
Pat4
7443
238
1723
265
1184
38
116
99
174
314
1449
796
270
554
4423
746
173
298
2066
1298
436
1856
136
260
1154
74
349
216
70
502
563
160
5163
5471
1177
808
Pat5
8747
385
1377
491
939
33
162
115
185
250
769
869
691
710
5771
541
148
196
3726
862
738
1917
85
227
980
127
756
108
56
960
542
287
4683
5079
972
297
Pat6
4933
329
804
517
814
16
113
80
203
353
1110
494
460
455
3374
270
145
104
1396
371
497
822
74
222
1419
57
528
311
77
403
543
264
3281
2330
2303
1014
Pat7
7950
337
1846
334
658
36
97
83
186
173
987
673
563
748
4328
361
131
144
2244
886
546
1189
91
232
1253
66
637
80
61
535
725
273
4822
3345
1574
998
Pat8
5031
163
1180
387
593
23
97
119
185
285
638
1013
321
392
3515
774
146
110
2142
501
406
1092
61
141
554
153
828
235
61
513
587
113
3978
1460
1731
663
Outline
• Dimension reduction
– PCA
– Clustering
• Classification
• Example: study of childhood leukemia
Childhood Leukemia
•
•
•
•
•
Cancer in the cells of the immune system
Approx. 35 new cases in Denmark every year
50 years ago – all patients died
Today – approx. 78% are cured
Riskgroups
– Standard
– Intermediate
– High
– Very high
– Extra high
• Treatment
– Chemotherapy
– Bone marrow transplantation
– Radiation
Prognostic Factors
Good prognosis
Poor prognosis
Immunophenotype
precursor B
T
Age
1-9
≥10
Leukocyte count
Low (<50*109/L)
High (>100*109/L)
Number of chromosomes
Hyperdiploidy
(>50)
Hypodiploidy
(<46)
Translocations
t(12;21)
t(9;22), t(1;19)
Treatment response
Good response
Poor response
Study of Childhood Leukemia
• Diagnostic bone marrow samples from leukemia
patients
• Platform: Affymetrix Focus Array
– 8763 human genes
• Immunophenotype
– 18 patients with precursor B immunophenotype
– 17 patients with T immunophenotype
• Outcome 5 years from diagnosis
– 11 patients with relapse
– 18 patients in complete remission
Principal Component Analysis
(PCA)
• used for visualization of complex data
• developed to capture as much of the variation in
data as possible
Principal components
• 1. principal component (PC1)
– the direction along which there is greatest variation
• 2. principal component (PC2)
– the direction with maximum variation left in data,
orthogonal to the 1. PC
• General about principal components
– linear combinations of the original variables
– uncorrelated with each other
Principal components
PCA - example
PCA on all Genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes reduced to 2
Outcome: PCA on all Genes
Principal components - Variance
25
Variance (%)
20
15
10
5
0
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
PC9 PC10
Clustering methods
• Hierarchical
– agglomerative
(buttom-up)
eg. UPGMA
- divisive
(top-down)
• Partitioning
– eg. K-means clustering
Hierarchical clustering
• Representation of all pairwise distances
• Parameters: none (distance measure)
• Results:
– in one large cluster
– hierarchical tree (dendrogram)
• Deterministic
Hierarchical clustering
– UPGMA Algorithm
•
•
•
•
Assign each item to its own cluster
Join the nearest clusters
Reestimate the distance between clusters
Repeat for 1 to n
Hierarchical clustering
Hierarchical clustering
Data with clustering order
and distances
Dendrogram representation
Leukemia data - clustering of patients
Leukemia data - clustering of genes
K-means clustering
• Partition data into K clusters
• Parameter: Number of clusters (K) must be chosen
• Randomilized initialization:
– different clusters each time
K-means - Algorithm
• Assign each item a class in 1 to K (randomly)
• For each class 1 to K
– Calculate the centroid (one of the K-means)
– Calculate distance from centroid to each item
• Assign each item to the nearest centroid
• Repeat until no items are re-assigned (convergence)
K-means clustering, K=3
K-means clustering, K=3
K-means clustering, K=3
K-means clustering of Leukemia data
Comparison of clustering methods
• Hierarchical clustering
– Distances between all variables
– Timeconsuming with a large number of gene
– Advantage to cluster on selected genes
• K-mean clustering
– Faster algorithm
– Does not show relations between all variables
 N

d ( xi , yi )    ( xi  yi ) 2 
 i 1

½
Distance measures
• Euclidian distance
½

2
d ( xi , yi )    ( xi  yi ) 
 i 1

N
• Vector angle distance
x y
x y
d ( xi , yi )  1  cos    1 
i
i
2
i
2
i
• Pearsons distance
d ( xi , yi )  1  CC   1 
 ( x  x)( y  y)
 ( x  x)  ( y  y )
i
i
2
i
i
2
Comparison of distance measures
Classification
•
•
•
•
Feature selection
Classification methods
Cross-validation
Training and testing
Reduction of input features
• Dimension reduction
– PCA
• Feature selection (gene selection)
– Significant genes: t-test
– Selection of a limited number of genes
Microarray Data
Class
precursorB
T
...
Patient1
Patient2
Patient3
Patient4
...
Gene1
1789.5
963.9
2079.5
3243.9
...
Gene2
46.4
52.0
22.3
27.1
...
Gene3
215.6
276.4
245.1
199.1
...
Gene4
176.9
504.6
420.5
380.4
Gene5
4023.0
3768.6
4257.8
4451.8
Gene6
12.6
12.1
37.7
38.7
...
...
...
...
...
Gene8793
312.5
415.9
1045.4
1308.0
...
Outcome: PCA on Selected Genes
Outcome Prediction:
CC against the Number of Genes
0.9
Correlation coefficient (CC)
0.7
Nearest centroid
0.5
0.3
0.1
-0.1
0
20
40
60
80
100
Number of top ranking genes
120
140
Nearest Centroid
• Calculation of a centroid for each class
xik   jC xij / nk
k
• Calculation of the distance between a test sample
and each class cetroid
• Class prediction by the nearest centroid method
K-Nearest Neighbor (KNN)
• Based on distance measure
– For example Euclidian distance
• Parameter k = number of nearest neighbors
– k=1
– k=3
– k=...
• Prediction by majority vote for odd numbers
Neural Networks
Gene2
Input
layer
Input
Gene6
Gene8793
Input
neuron
Hidden
neuron
Hidden
Hidden neurons
layer
Output
Output
layer
Output
neuron
B
T
...
Comparison of Methods
Nearest centroid
Neural networks
KNN
Simple methods
Based on distance calculation
Good for simple problems
Good for few training samples
Distribution of data assumed
Advanced methods
Involve machine learning
Several adjustable parameters
Many training samples required
(eg. 50-100)
Flexible methods
Cross-validation
Data: 10 samples
Cross-5-validation:
Training: 4/5 of data (8 samples)
Testing: 1/5 of data (2 samples)
-> 5 different models
Leave-one-out cross-validation (LOOCV)
Training: 9/10 of data (9 samples)
Testing: 1/10 of data (1 sample)
-> 10 different models
Validation
• Definition of
– true and false positives
– true and false negatives
Actual class
Predicted class
B
B
TP
B
T
FN
T
T
TN
T
B
FP
Accuracy
• Definition:
TP + TN
TP + TN + FP + FN
• Range: 0 – 100%
Matthews correlation coefficient
• Definition:
TP·TN - FN·FP
√(TN+FN)(TN+FP)(TP+FN)(TP+FP)
• Range: (-1) – 1
Overview of Classification
Expression data
Subdivision of data for cross-validation
into training sets and test sets
Feature selection (t-test)
Dimension reduction (PCA)
Training of classifier:
- using cross-validation
- choice of method
- choice of optimal parameters
Testing of classifier
Independant test set
Important Points
• Avoid overfitting
• Validate performance
– Test on an independant test set
– by using cross-validation
• Include feature selection in cross-validation
Why?
– To avoid overestimation of performance!
– To make a general classifier
Study of Childhood Leukemia: Results
• Classification of immunophenotype (precursorB
og T)
– 100% accuracy
• During the training
• When testing on an independant test set
– Simple classification methods applied
• K-nearest neighbor
• Nearest centroid
• Classification of outcome (relapse or remission)
– 78% accuracy (CC = 0.59)
– Simple and advanced classification methods applied
Risk classification in the future ?
Patient:
Prognostic factors:
Risk group:
Clincal data
Immunophenotype
Age
Leukocyte count
Number of
chromosomes
Translocations
Treatment response
Standard
Intermediate
High
Very high
Ekstra high
Immunophenotyping
Morphology
Genetic
measurements
Microarray
technology
Custom
designed
treatment
Summary
• Dimension reduction important to visualize data
– Principal Component Analysis
– Clustering
• Hierarchical
• Partitioning (K-means)
(distance measure important)
• Classification
– Reduction of dimension often nessesary (t-test, PCA)
– Several classification methods avaliable
– Validation
Coffee break
Related documents