* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Clustering_PartII_2012
Genetically modified crops wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
X-inactivation wikipedia , lookup
Heritability of IQ wikipedia , lookup
Gene desert wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Metagenomics wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Oncogenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Pathogenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Essential gene wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Minimal genome wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
baySeq homework HS analysis: Out of 7388 genes with data, 1995 genes were DE at FDR <1%, 3158 genes were DE at FDR <5% There were 3,582 genes with an average fold-change >2X (1.0 in log2 space) 2,669 (63%) BUT HS + EtOH analysis (added 2 replicates of a new conditions): Only 1618 genes were DE (at any of the models) at FDR of 5% ??? Why so few when 3157 met this cutoff when HS was analyzed alone? baySeq paper: harder to call DE with “more complex” models 1 How well did baySeq do on the HS only analysis? 3158 genes FDR <0.05 (10K it on prior calc) How well did baySeq do on the HS only analysis? ~50% of these: low counts Many of remaining missed due to day-to-day variation that is not accounted for without pairing the data 3 902 genes FDR >5% but fold-change >1.5X in both replicates How well did baySeq do on the HS + EtOH analysis? Models: NDE = 1,1,1,1,1,1 DEH = 1,1,2,2,1,1 DEE = 1,1,1,1,2,2 DEHE = 1,1,2,2,2,2 DEHE2 = 1,1,2,2,3,3 1618 genes FDR <0.05 to at least one DE model How well did baySeq do on the HS only analysis? But, 1391 genes with FDR > 0.05 to all DE models but at least 1.5X expression change in all 4 samples Why weren’t these identified as DE? 218 of these genes were DE when HS was analyzed ALONE. 5 Assessing sensitivity (with VLOOKUP in Excel) There were 64 known Hsf1 targets *with data* on the file. My run identified 38 of those at an FDR of 0.01 38/64 59.4% sensitivity 45 were identified at FDR of 0.05% 45/64 70% sensitivity 6 LAST TIME: Gene X: X1 x coordinate X2 X3 z coordinate y coordinate LAST TIME: ‘centroid’ (average vector) 4. Centroid linkage clustering 8 Sometimes, want to use the weighted pearson correlation S x,y = 1 N N S i=1 (Xi) 1 N N S (Yi) 2 Xi i=1 1 N N S 2 Yi i=1 Gene X: X1 X2 X3 X4 X5 Gene Y: Y1 Y2 Y3 Y4 Y5 For example: if these arrays are identical, the data are over-represented 3X 9 Sometimes, want to use the weighted pearson correlation 1 S x,y = S wi N S i=1 (Xi) wi 1 N N S (Yi) 2 Xi i=1 1 N N S 2 Yi Where wi = 1 Li k = array corr. cutoff d = Pearson distance (= 1 - P. corr) n = exponent (usually 1) Gene X: X1 X2 X3 X4 X5 Gene Y: Y1 Y2 Y3 Y4 Y5 For example: if these arrays are identical, the data are over-represented 3X -- can weight experiments i = 3,4,5 by w = 0.33 10 Unweighted Pearson correlation Weighted Pearson correlation 11 Unweighted Pearson correlation Weighted Pearson correlation 12 Can also cluster array experiments based on global similarity in expression 13 Alizadeh et al. 2000 Hierarchical trees of gene expression data are analogous to phylogenetic trees A D B E Distance between genes is proportionate to the total branchlength between genes (not the distance on the y-axis) F C Orientation of the nodes is irrelevant …. although some clustering programs try to organize nodes in some way. 14 Hierarchical trees of gene expression data are analogous to phylogenetic trees A D B E Distance between genes is proportionate to the total branchlength between genes (not the distance on the y-axis) F C D B Orientation of the nodes is irrelevant …. although some clustering programs try to organize nodes in some way. A E F C 15 Genes involved in same cellular process are often coregulated These genes may not have the same annotation, but still function together and are thus co-expressed 16 M choose i = # of possible groups of size i composed of the objects M = M! (M-i)! * i ! 17 Advantages and Disadvantages of Hierarchical clustering Advantages: 1) Straightforward 2) Captures biological information relatively well Disadvantages: 1) Doesn’t give discrete clusters … need to define clusters with cutoffs 2) Hierarchical arrangement does not always represent data appropriately -- sometimes a hierarchy is not appropriate: genes can belong only to one cluster. 3) Get different clustering for different experiment sets THERE IS NO ONE PERFECT CLUSTERING METHOD 18 k-means clustering Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable 19 k-means clustering Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable Centroids 20 k-means clustering Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable What are the disadvantages of k-means clustering? 21 k-means clustering Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable What are the disadvantages of k-means clustering? - Need to know how many clusters to ask for (can define this empirically) - Genes are not organized within each cluster (can hierarchically cluster genes afterwards or use SOM analysis) - Random process makes this an indeterminate method 22