Download Principal Component Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Correspondence analysis applied
to microarray data
• Kurt Fellenberg C. Hausernedikt Brorsrt Neutzner', Jo( rg D.
Hoheiselartin Vingron
• http://www.dkfz-heidelberg.de/funct_genome/PDF-Files/PNAS-98(2001)-10781.pdf
• www.pnas.org/cgi/doi/10.1073/pnas.181597298
Principal Component Analysis
• Given N data vectors from k-dimensions, find c <= k orthogonal
vectors that can be best used to represent data
– The original data set is reduced to one consisting of N data vectors
on c principal components (reduced dimensions)
• Each data vector is a linear combination of the c principal component
vectors
• Project on the subspace which preserve the most of the data variability:
1 n
I 
d ( xi , x )

n  1 i 1
2
2
d
d ( xi , x )   ( xij  x j ) 2
2
j 1
•
Correspondence
analysis
CA= PCA for categorical variables
Example:Dataset X
-27 dog species 7 categorical variables
Name Height Weight Speed Intelligence Affection Agresivity Function
- + ++ - + ++ - + ++ - + ++ - +
- +
C H U (company,Hunt,Utility)
1. Boxer 0 1 0 0 1 0 0 1 0 0 1 0 0 1
0 1
1 0 0
…
27.Caniche 1 0 0 1 0 0 0 1 0
=X
001
0 1
1 0
1 0 0
CA study dependence between 2 categorical variables
-Height,Function
Works on crossable N
Height/Function C
+
++
H U
6 1 0| 7
3 2 0| 5
1 6 8| 15
----------10 9 8| 27
C
H
U marginals
n11 n12 n13 |n1+
N= n21 n22 n23 |n2+
n31 n32 n33 |n3+
--------------n+1 n+2 n+3 |n
+
++
marginals
Correspondence analysis
• Divide each line by its total
Height/Function C
H
U
6/7 1/7 0/7| 7
+
3/5 2/5 0/5| 5
++
1/5 6/5 8/5| 15
C
H U
n11/n1+ n12/n1+ n13/n1+
n21/n2+ n22/n2+ n23/n2+
n31/n3+ n32/n3+ n33/n3+
Each row become a point in probability space over the categories of the second variable
(conditional distributions given the category value of the first variable)
# points =#categories of first variable =m
dimension of space = #number of categories of second variable l
distance between points (probabilities) –weighted Euclidian distance –low when indep
(can transform data and work with usual Euclidian distance)
nih n jh 2
(

)
l
n
n
j
d (i, j ) 2   i 
n h
h 1
n
Correspondence analysis
• Each row I considered with weight=
ni n
• Measure for total variability I= chi-square statistic =measure of
dependence between the two variables
I
2

m

i 1
2
ni 
d (i, g ) 
n
I2 
1
n
m
l

i 1 h 1
m
l

i 1 h 1
ni  n
n
n
( ih   h ) 2
n n h ni 
n
ni  n h 2
)
1 2
n


ni  n j
n
n
( nih 
CA –visualization of the cell that contribute most to
dependence: if an n_ih has an outstanding value then both
line i and column h will be far from g in the same direction
Correspondence analysis
Dimension reduction –project (in norm chi2) on the subspace that preserves the most of the variability
(dependence)
New variable =linear combinations of the initial ones
Like in PCA -solutions in term of eigenvalues/eigenvectors of N
-eigenvalue –gives proportion of variability preserved
-measures for how well each point is represented in the subspace
-measures for contribution of each point/category in determining the optimal subspace subspace “meaning”
Height/Function
C
H
U
CA1 CA2
C
H U
6/7 1/7 0/7| 7
1.10 -.92 n11/n1+ n12/n1+ n13/n1+
+
3/5 2/5 0/5| 5
0.85 1.23 n21/n2+ n22/n2+ n23/n2+
++
1/15 6/15 8/15| 15 - 0.84 0.02 n31/n3+ n32/n3+ n33/n3+
Close CA1 (and good points representation in subspace) means similar category (ease to visualize-identify similar
categories of the first variable (Height) in the low dimensional plot) (-+ height have similar function)
If join two “identical categories” the chi2- distance do not change
Repeat everything for transpose(N)
Height/Function
+
++
C H
6 1
3 2
1 6
10 9
U
0|
0|
8|
8
Function/Height +
++ CA1 CA2
C
6/10 3/10 1/10 1.04
-.10
H
1/9
2/9 6/9 -0.32
.43
U
0/8
0/8 0/9 -0.94 -.37
Each column become a point in probability space over the categories of the first variable (conditional
distributions given the category value of the second variable)
# points =#categories of second variable =l
dimension of space = #number of categories of first variable m
CA- New variables =linear combinations of the initial ones preserving dependence best
Close CA1 (and good points representation in subspace) means similar category (ease to visualizeidentify similar categories of the second variable (Function) in the low dimensional plot)
(U,H functions have similar heights)
Overlap the two plots
Function/Height +
++ CA1 CA2
C
6/10 3/10 1/10 1.04
-.10
H
1/9
2/9 6/9 -0.32
.43
U
0/8
0/8 0/9 -0.94 -.37
CA1
1.10 .85 -.84
CA2
-.92 1.93 0.02
CA value in one plot are (up to a scale) weighted means of CA values in the second plot with weight corresponding to
the conditional probability:
1.04= (6/10*(-.92)+3/10*1.93+1/10*0.02)*constant
Include “standard coordinates” =virtual rows concentrated on one column (1 0 0) (0 1 0) (0 0 1)
Categories of different variable close to the extreme of the axes and to each other are highly correlated:
Utility dogs are big; Company dogs are small
(see also shaving gene classification)
“close to the extremes of the axes”
If reorder the rows and columns by first CA – generally cells
with high values go on diagonal
Height/Function
C H U CA1
6 1 0| 1.10
+
3 2 0| .85
++
1 6 8| -.84
CA1
1.04 -.32 -.94
Extension
• Treat X as N (crossable for two possible variable with 27 respective 6
categories
Name Height Function
- + ++ C H U
1. Boxer 0 1 0 1 0 0
…
=X
27.Caniche 1 0 0 1 0 0
- + ++
C H
0 1/2 0 ½
0
U CA1 CA2
0 .45 .88
½ 0
0 ½ 0
0 .91 .02
CA1 1.2 .85 -.84 1.04 -.32 -.4
Plot from transpose(X) identical to overlapped plots above
New plot from X – extra points for each dog race
Relationship Height/Function- Dog race: Canish is small dog for company
Multiple correspondence analysis
Use the whole X (all the variables) as crosstable
Name Height Weight
- + ++ - + ++ 1 . Boxer 0 1 0 0 1 0 0
Speed Intelligence Affection Agresivity Function
+ ++ + ++
- +
- +
C H U
1 0 0
1 0
0 1
0 1
1 0 0
=X
…
27.Caniche 1 0 0
CA1
CA2
10 0
0 1 0 0
0 1
0 1
.32 .60 -.89 -35 .37 -.34 -.84 .78
-1.04 .89 .37 -.81 .29 .46 -.29 .27
Discovering association rules (based on correlation):
Company dogs are small, with high affectivity
Utility dogs are big, fast, aggressive
Hunt dogs are very intelligent
Use them for classification
1 0
.40 -.43
.19 -.21
1 0 0
What Is Association Mining?
•
Association rule mining:
– Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction databases,
relational databases, and other information repositories.
– Frequent pattern: pattern (set of items, sequence, etc.) that occurs
frequently in a database [AIS93]
• Motivation: finding regularities in data
– What products were often purchased together? — Beer and
diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
Basic Concepts: Frequent Patterns and
Association Rules
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Customer
buys both
Customer
buys beer
Customer
buys diaper

Itemset X={x1, …, xk}
Find all the rules XY with min
confidence and support

support, s, probability that a
transaction contains XY
confidence, c, conditional
probability that a transaction
having X also contains Y.

Let min_support = 50%,
min_conf = 50%:
A  C (50%, 66.7%)
C  A (50%, 100%)
Algorithms for scalable mining of (single-dimensional Boolean)
association rules in transactional databases
•
•
•
•
Apriori pruning principle: If there is any itemset which is infrequent, its
superset should not be generated/tested!
Method:
– generate length (k+1) candidate itemsets from length k frequent itemsets,
– test the candidates against DB
Challenges
– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
Construct FP-tree From A Transaction Database
– For each frequent item, construct its conditional pattern-base, and then its
conditional FP-tree
– Repeat the process on each newly created conditional FP-tree
– Until the resulting FP-tree is empty, or it contains only one path—single
path will generate all the combinations of its sub-paths, each of which is a
frequent pattern
Association-Based Classification
• Several methods for association-based
classification
– ARCS: Quantitative association mining and clustering
of association rules (Lent et al’97)
• It beats C4.5 in (mainly) scalability and also accuracy
– Associative classification: (Liu et al’98)
• It mines high support and high confidence rules in the form of
“cond_set => y”, where y is a class label
– CAEP (Classification by aggregating emerging
patterns) (Dong et al’99)
• Emerging patterns (EPs): the itemsets whose support increases
significantly from one class to another
• Mine Eps based on minimum support and growth rate
Table 1. Cell-cycle data as used in analysis
Table 1. Cell-cycle data as used in analysis
•The raw intensity data as obtained from ais imaging software (Imaging Research, St.
Catherines, ON, Canada) were normalized
•The normalized data matrix was filtered for genes with positive minmax separation for at
least one of the conditions under study (2).
•The data were shifted to a positive range by adding the minimum + 1
alpha0 alpha7 alpha14 alpha21 alpha28 alph1a35 …
(M/G1) (M/G1)
(G1)
(G1)
(S)
(S) ..
YHR126C
5.81
5.73
6.01
5.48
5.37
5.23 …
YOR066W
5.62
5.81
6.02
5.28
5.02
5.23 …
hxt4
5.78
6.21
6.02
5.5
5.58
5.21 …
PCL9
4.64
5.39
4.89
5.19
4.96
5.62 …
mcm3
5.38
5.8
6.13
5.74
4.52
5.22 …
.
.
.
.
.
.
.
. …
* 800 genes X 73 hybridizations
4 cell-cycle arrest methods of hybridization (18-alpha,24-cdc15,17-cdc28,14-elu)
Samples from each method are drawn and their cell-cycle phase had been classified –5
classes
* link toward database with information (meaning, functionality etc) for each gene provided
Each cell-cycle phase colored differently
(M/G1),(G1),(S), (G2),(M)
-can see that hybridization separate according to their cell-cycle phase
(one phase = one region of the plot)
- G1 phase strongly associated with histone gene cluster
- cdc15-30 hybridization classified yellow behave green
(located in green region)
-cdc15-70
-cdc15-80
suggest improper phase classification for these samples
(check with the profiles –proves correct)