* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Clustering2_11-8
Genetically modified crops wikipedia , lookup
Transposable element wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Metagenomics wikipedia , lookup
X-inactivation wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Heritability of IQ wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Gene desert wikipedia , lookup
Oncogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
History of genetic engineering wikipedia , lookup
Pathogenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Essential gene wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome evolution wikipedia , lookup
Gene expression programming wikipedia , lookup
Designer baby wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Minimal genome wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Limma homework Is it possible that some of these gene expression changes are miscalled (i.e. biologically significant but insignificant p value and vice versa) and why? What other criteria might you use to distinguish genes you care about? How many genes pass the cutoff of q<0.01 and how does this compare to the number of genes that pass the Bonferroni corrected p-value? 1920 genes have q<0.01 … only 20 genes have a p < 1.87 x 10-6 (0.01/ 5338 Ttests) What does using a q-value cutoff of 0.01 correspond to? About 1% of selected genes (i.e. 19 genes out of 1920 with q<0.01) could be false positives. Sensitivity: 66 known Hsf1 targets (* but we only had data for 62) 40 of them had q<0.01. Sensitivity: 40/62 = 64.5% 1 LAST TIME: Gene X: X1 x coordinate X2 X3 z coordinate y coordinate LAST TIME: ‘centroid’ (average vector) 4. Centroid linkage clustering 3 Sometimes, want to use the weighted pearson correlation S x,y = 1 N N S i=1 (Xi) 1 N N S (Yi) 2 Xi i=1 1 N N S 2 Yi i=1 Gene X: X1 X2 X3 X4 X5 Gene Y: Y1 Y2 Y3 Y4 Y5 For example: if these arrays are identical, the data are over-represented 3X 4 Sometimes, want to use the weighted pearson correlation 1 S x,y = S wi N S i=1 (Xi) wi 1 N N S (Yi) 2 Xi i=1 1 N N S 2 Yi Where wi = 1 Li k = array corr. cutoff d = Pearson distance (= 1 - P. corr) n = exponent (usually 1) Gene X: X1 X2 X3 X4 X5 Gene Y: Y1 Y2 Y3 Y4 Y5 For example: if these arrays are identical, the data are over-represented 3X -- can weight experiments i = 3,4,5 by w = 0.33 5 Unweighted Pearson correlation Weighted Pearson correlation 6 Unweighted Pearson correlation Weighted Pearson correlation 7 Can also cluster array experiments based on global similarity in expression Alizadeh et al. 20008 Hierarchical trees of gene expression data are analogous to phylogenetic trees A D B E Distance between genes is proportionate to the total branchlength between genes (not the distance on the y-axis) F C Orientation of the nodes is irrelevant …. although some clustering programs try to organize nodes in some way. 9 Hierarchical trees of gene expression data are analogous to phylogenetic trees A D B E Distance between genes is proportionate to the total branchlength between genes (not the distance on the y-axis) F C D B Orientation of the nodes is irrelevant …. although some clustering programs try to organize nodes in some way. A E F C 10 Genes involved in same cellular process are often coregulated These genes may not have the same annotation, but still function together and are thus co-expressed 11 M choose i = # of possible groups of size i composed of the objects M = M! (M-i)! * i ! 12 Advantages and Disadvantages of Hierarchical clustering Advantages: 1) Straightforward 2) Captures biological information relatively well Disadvantages: 1) Doesn’t give discrete clusters … need to define clusters with cutoffs 2) Hierarchical arrangement does not always represent data appropriately -- sometimes a hierarchy is not appropriate: genes can belong only to one cluster. 3) Get different clustering for different experiment sets THERE IS NO ONE PERFECT CLUSTERING METHOD 13 k-means clustering Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable 14 k-means clustering Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable Centroids 15 k-means clustering Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable What are the disadvantages of k-means clustering? 16 k-means clustering Partitioning (or top-down) clustering method -- Randomly split the data into k groups of equal number of genes -- Calculate the centroid of each group -- Reassign genes to the centroid to which it is most similar -- Calculate a new centroid for each group, reassign genes, etc … iterate until stable What are the disadvantages of k-means clustering? - Need to know how many clusters to ask for (can define this empirically) - Genes are not organized within each cluster (can hierarchically cluster genes afterwards or use SOM analysis) - Random process makes this an indeterminate method 17