Download Microarray Analysis 3

Microarray Data Analysis   Data preprocessing and visualization Supervised learning   Unsupervised learning     Machine learning approaches Clustering and pattern detection Gene regulatory regions predictions based coregulated genes Linkage between gene expression data and gene sequence/function databases … Unsupervised learning   Supervised methods  Can only validate or reject hypotheses  Can not lead to discovery of unexpected partitions Unsupervised learning  No prior knowledge is used  Explore structure of data on the basis of corrections and similarities DEFINITION OF THE CLUSTERING PROBLEM Eytan Domany CLUSTER ANALYSIS YIELDS DENDROGRAM T (RESOLUTION) Eytan Domany BUT WHAT ABOUT THE OKAPI? Eytan Domany Centroid methods – K-means Data points at Xi , i= 1,...,N Centroids at Y ,  = 1,...,K Assign data point i to centroid  ; Si =  Cost E: N E(S1 , S2 ,...,SN ; Y1 ,...YK ) = K 2  ( S ,  )( X  Y )  i i  i 1  1 Minimize E over Si , Y Eytan Domany K-means  “Guess” K=3 Eytan Domany K-means  Start with random positions of centroids. Iteration = 0 Eytan Domany K-means   Start with random positions of centroids. Assign each data point to closest centroid. Iteration = 1 Eytan Domany K-means    Start with random positions of centroids. Assign each data point to closest centroid. Move centroids to center of assigned points Iteration = 2 Eytan Domany K-means     Start with random positions of centroids. Assign each data point to closest centroid. Move centroids to center of assigned points Iterate till minimal cost Iteration = 3 Eytan Domany K-means - Summary  Fast algorithm: compute distances from data points to centroids  Result depends on initial centroids’ position Must preset K Fails for “non-spherical” distributions   Agglomerative Hierarchical Clustering Need to define the distance between the at each step merge pair of nearest clusters new cluster and the other clusters. initially – each point = cluster Single Linkage: distance between closest pair. Distance between joined clusters Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs 4 2 or distance between cluster centers 5 3 1 1 3 2 4 5 The dendrogram induces a linear ordering of the data points Dendrogram Eytan Domany Hierarchical Clustering Summary  Results depend on distance update method  Greedy iterative process  NOT robust against noise  No inherent measure to identify stable clusters  Average Linkage – the most widely used clustering method in gene expression analysis nature 2002 breast cancer Heat map Cluster both genes and samples  Sample should cluster together based on experimental design  Often a way to catch labelling errors or heterogeneity in samples Epinephrine Treated Rat Fibroblast Cell ID Probe 1h 5h 10h 18h 24h 1 D21869_s_at 25.7 55.0 170.7 305.5 807.9 2 D25233_at 705.2 578.2 629.2 641.7 795.3 3 D25543_at 2148.7 1303.0 915.5 149.2 96.3 4 L03294_g_at 241.8 421.5 577.2 866.1 2107.3 5 J03960_at 774.5 439.8 314.3 256.1 44.4 6 M81855_at 1487.6 1283.7 1372.1 1469.1 1611.7 7 L14936_at 1212.6 1848.5 2436.2 3260.5 4650.9 8 L19998_at 767.9 290.8 300.2 129.4 51.5 9 AB017912_at 1813.7 3520.6 4404.3 6853.1 9039.4 10 M32855_at 234.1 23.1 789.4 312.7 67.8 Heap map Correlation coeff Normalized across each gene Distance Issues  Euclidean distance g1 g3 g2 g4 ■ Pearson distance 400 350 300 250 time0 time1 time2 time3 200 150 100 50 0 gene1 gene2 gene3 gene4 Exercise  Use Average Linkage Algorithm and Manhattan distance. Gene ID Exp1 Exp2 1 2 3 45 55 148 55 78 1303 4 5 6 241 774 607 765 439 383 Exercise Issues in Cluster Analysis       A lot of clustering algorithms A lot of distance/similarity metrics Which clustering algorithm runs faster and uses less memory? How many clusters after all? Are the clusters stable? Are the clusters meaningful? Which Clustering Method Should I Use?      What is the biological question? Do I have a preconceived notion of how many clusters there should be? How strict do I want to be? Spilt or Join? Can a gene be in multiple clusters? Hard or soft boundaries between clusters The End  Thank you for taking this course. Bioinformatics is a very diverse and fascinating subject. We hope you all decide to continue your pursuit of it.  We will be very glad to answer your emails or schedule appointments to talk about any bioinformatics related questions you might have.  We wish you all have a wonderful summer break!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Microarray Analysis 3