Download Clustering_PartII_2012

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetically modified crops wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Epistasis wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

X-inactivation wikipedia , lookup

Heritability of IQ wikipedia , lookup

Twin study wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Metagenomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Oncogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Public health genomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Pathogenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Essential gene wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Designer baby wikipedia , lookup

Genomic imprinting wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Minimal genome wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
baySeq homework
HS analysis:
Out of 7388 genes with data, 1995 genes were DE at FDR <1%,
3158 genes were DE at FDR <5%
There were 3,582 genes with an average fold-change >2X (1.0 in log2 space)
2,669 (63%)
BUT
HS + EtOH analysis (added 2 replicates of a new conditions):
Only 1618 genes were DE (at any of the models) at FDR of 5%
??? Why so few when 3157 met this cutoff when HS was analyzed alone?
baySeq paper: harder to call DE with “more complex” models
1
How well did baySeq do on the HS only analysis?
3158 genes FDR <0.05 (10K it on prior calc)
How well did baySeq do on the HS only analysis?
~50% of these: low counts
Many of remaining missed due
to day-to-day variation that
is not accounted for without
pairing the data
3
902 genes FDR >5%
but fold-change >1.5X in both replicates
How well did baySeq do on the HS + EtOH analysis?
Models:
NDE = 1,1,1,1,1,1
DEH = 1,1,2,2,1,1
DEE = 1,1,1,1,2,2
DEHE = 1,1,2,2,2,2
DEHE2 = 1,1,2,2,3,3
1618 genes FDR <0.05
to at least one DE model
How well did baySeq do on the HS only analysis?
But, 1391 genes with FDR > 0.05
to all DE models but at least 1.5X
expression change in all 4 samples
Why weren’t these identified as DE?
218 of these genes were DE when HS
was analyzed ALONE.
5
Assessing sensitivity (with VLOOKUP in Excel)
There were 64 known Hsf1 targets *with data* on the file.
My run identified 38 of those at an FDR of 0.01
38/64  59.4% sensitivity
45 were identified at FDR of 0.05%
45/64  70% sensitivity
6
LAST TIME:
Gene X: X1
x coordinate
X2
X3
z coordinate
y coordinate
LAST TIME:
‘centroid’
(average vector)
4. Centroid linkage clustering
8
Sometimes, want to use the weighted pearson correlation
S x,y =
1
N
N
S
i=1
(Xi)
1
N
N
S
(Yi)
2
Xi
i=1
1
N
N
S
2
Yi
i=1
Gene X: X1 X2 X3 X4 X5
Gene Y: Y1 Y2 Y3 Y4 Y5
For example: if these arrays are identical,
the data are over-represented 3X
9
Sometimes, want to use the weighted pearson correlation
1
S x,y =
S wi
N
S
i=1
(Xi)
wi
1
N
N
S
(Yi)
2
Xi
i=1
1
N
N
S
2
Yi
Where wi = 1
Li
k = array corr. cutoff
d = Pearson distance (= 1 - P. corr)
n = exponent (usually 1)
Gene X: X1 X2 X3 X4 X5
Gene Y: Y1 Y2 Y3 Y4 Y5
For example: if these arrays are identical,
the data are over-represented 3X
-- can weight experiments i = 3,4,5 by w = 0.33 10
Unweighted Pearson correlation
Weighted Pearson correlation
11
Unweighted Pearson correlation
Weighted Pearson correlation
12
Can also cluster
array experiments
based on global
similarity in expression
13
Alizadeh et al. 2000
Hierarchical trees of gene expression data are analogous to phylogenetic trees
A
D
B
E
Distance between genes is
proportionate to the total branchlength
between genes (not the distance on the y-axis)
F
C
Orientation of the nodes is irrelevant ….
although some clustering programs try to
organize nodes in some way.
14
Hierarchical trees of gene expression data are analogous to phylogenetic trees
A
D
B
E
Distance between genes is
proportionate to the total branchlength
between genes (not the distance on the y-axis)
F
C
D
B
Orientation of the nodes is irrelevant ….
although some clustering programs try to
organize nodes in some way.
A
E
F
C
15
Genes involved in same cellular process are often coregulated
These genes may not have the same annotation, but still function together and are thus co-expressed
16
M choose i = # of possible groups of size i
composed of the objects M
=
M!
(M-i)! * i !
17
Advantages and Disadvantages of Hierarchical clustering
Advantages:
1) Straightforward
2) Captures biological information relatively well
Disadvantages:
1) Doesn’t give discrete clusters … need to define clusters with cutoffs
2) Hierarchical arrangement does not always represent data appropriately
-- sometimes a hierarchy is not appropriate: genes can belong
only to one cluster.
3) Get different clustering for different experiment sets
THERE IS NO ONE PERFECT CLUSTERING METHOD
18
k-means clustering
Partitioning (or top-down) clustering method
-- Randomly split the data into k groups of equal number of genes
-- Calculate the centroid of each group
-- Reassign genes to the centroid to which it is most similar
-- Calculate a new centroid for each group, reassign genes, etc … iterate until stable
19
k-means clustering
Partitioning (or top-down) clustering method
-- Randomly split the data into k groups of equal number of genes
-- Calculate the centroid of each group
-- Reassign genes to the centroid to which it is most similar
-- Calculate a new centroid for each group, reassign genes, etc … iterate until stable
Centroids
20
k-means clustering
Partitioning (or top-down) clustering method
-- Randomly split the data into k groups of equal number of genes
-- Calculate the centroid of each group
-- Reassign genes to the centroid to which it is most similar
-- Calculate a new centroid for each group, reassign genes, etc … iterate until stable
What are the disadvantages of k-means clustering?
21
k-means clustering
Partitioning (or top-down) clustering method
-- Randomly split the data into k groups of equal number of genes
-- Calculate the centroid of each group
-- Reassign genes to the centroid to which it is most similar
-- Calculate a new centroid for each group, reassign genes, etc … iterate until stable
What are the disadvantages of k-means clustering?
- Need to know how many clusters to ask for
(can define this empirically)
- Genes are not organized within each cluster
(can hierarchically cluster genes afterwards or use SOM analysis)
- Random process makes this an indeterminate method
22