Download Clustering2_11-8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetically modified crops wikipedia , lookup

Transposable element wikipedia , lookup

Pharmacogenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Epistasis wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Metagenomics wikipedia , lookup

X-inactivation wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Heritability of IQ wikipedia , lookup

Twin study wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene desert wikipedia , lookup

Oncogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Essential gene wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Minimal genome wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Limma homework
Is it possible that some of these gene expression changes are miscalled
(i.e. biologically significant but insignificant p value and vice versa) and why?
What other criteria might you use to distinguish genes you care about?
How many genes pass the cutoff of q<0.01 and how does this compare to the number
of genes that pass the Bonferroni corrected p-value?
1920 genes have q<0.01 … only 20 genes have a p < 1.87 x 10-6 (0.01/ 5338 Ttests)
What does using a q-value cutoff of 0.01 correspond to?
About 1% of selected genes (i.e. 19 genes out of 1920 with q<0.01) could be false
positives.
Sensitivity:
66 known Hsf1 targets (* but we only had data for 62)
40 of them had q<0.01.
Sensitivity: 40/62 = 64.5%
1
LAST TIME:
Gene X: X1
x coordinate
X2
X3
z coordinate
y coordinate
LAST TIME:
‘centroid’
(average vector)
4. Centroid linkage clustering
3
Sometimes, want to use the weighted pearson correlation
S x,y =
1
N
N
S
i=1
(Xi)
1
N
N
S
(Yi)
2
Xi
i=1
1
N
N
S
2
Yi
i=1
Gene X: X1 X2 X3 X4 X5
Gene Y: Y1 Y2 Y3 Y4 Y5
For example: if these arrays are identical,
the data are over-represented 3X
4
Sometimes, want to use the weighted pearson correlation
1
S x,y =
S wi
N
S
i=1
(Xi)
wi
1
N
N
S
(Yi)
2
Xi
i=1
1
N
N
S
2
Yi
Where wi = 1
Li
k = array corr. cutoff
d = Pearson distance (= 1 - P. corr)
n = exponent (usually 1)
Gene X: X1 X2 X3 X4 X5
Gene Y: Y1 Y2 Y3 Y4 Y5
For example: if these arrays are identical,
the data are over-represented 3X
-- can weight experiments i = 3,4,5 by w = 0.33
5
Unweighted Pearson correlation
Weighted Pearson correlation
6
Unweighted Pearson correlation
Weighted Pearson correlation
7
Can also cluster
array experiments
based on global
similarity in expression
Alizadeh et al. 20008
Hierarchical trees of gene expression data are analogous to phylogenetic trees
A
D
B
E
Distance between genes is
proportionate to the total branchlength
between genes (not the distance on the y-axis)
F
C
Orientation of the nodes is irrelevant ….
although some clustering programs try to
organize nodes in some way.
9
Hierarchical trees of gene expression data are analogous to phylogenetic trees
A
D
B
E
Distance between genes is
proportionate to the total branchlength
between genes (not the distance on the y-axis)
F
C
D
B
Orientation of the nodes is irrelevant ….
although some clustering programs try to
organize nodes in some way.
A
E
F
C
10
Genes involved in same cellular process are often coregulated
These genes may not have the same annotation, but still function together and are thus co-expressed
11
M choose i = # of possible groups of size i
composed of the objects M
=
M!
(M-i)! * i !
12
Advantages and Disadvantages of Hierarchical clustering
Advantages:
1) Straightforward
2) Captures biological information relatively well
Disadvantages:
1) Doesn’t give discrete clusters … need to define clusters with cutoffs
2) Hierarchical arrangement does not always represent data appropriately
-- sometimes a hierarchy is not appropriate: genes can belong
only to one cluster.
3) Get different clustering for different experiment sets
THERE IS NO ONE PERFECT CLUSTERING METHOD
13
k-means clustering
Partitioning (or top-down) clustering method
-- Randomly split the data into k groups of equal number of genes
-- Calculate the centroid of each group
-- Reassign genes to the centroid to which it is most similar
-- Calculate a new centroid for each group, reassign genes, etc … iterate until stable
14
k-means clustering
Partitioning (or top-down) clustering method
-- Randomly split the data into k groups of equal number of genes
-- Calculate the centroid of each group
-- Reassign genes to the centroid to which it is most similar
-- Calculate a new centroid for each group, reassign genes, etc … iterate until stable
Centroids
15
k-means clustering
Partitioning (or top-down) clustering method
-- Randomly split the data into k groups of equal number of genes
-- Calculate the centroid of each group
-- Reassign genes to the centroid to which it is most similar
-- Calculate a new centroid for each group, reassign genes, etc … iterate until stable
What are the disadvantages of k-means clustering?
16
k-means clustering
Partitioning (or top-down) clustering method
-- Randomly split the data into k groups of equal number of genes
-- Calculate the centroid of each group
-- Reassign genes to the centroid to which it is most similar
-- Calculate a new centroid for each group, reassign genes, etc … iterate until stable
What are the disadvantages of k-means clustering?
- Need to know how many clusters to ask for
(can define this empirically)
- Genes are not organized within each cluster
(can hierarchically cluster genes afterwards or use SOM analysis)
- Random process makes this an indeterminate method
17