Download microarrays

Microarray Technology and Data Analysis (November 28, 2007) slides assembled by Dong-Guk Shin and J Peter Gogarten Introduction to Microarray Technology Two color microarrays: QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. two conditions two labels for cDNA QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. develop slide with mRNAs hybridize mixture of both probes to printed glass slides QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. make images, one for each probe fuse in computer QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. An alternative is to synthesize the DNA directly onto the matrix (slides from Affymetrix) created through photolithography on cell in array hybridization to labeled RNA from sample result of hybridization to array Experimental design and sources for variation Effect Size Biological Variation - Array & Environment Variation Rules of thumb: •Biological Replicates are a must! •As many biological replicates as you can afford! •Cell population as homogeneous as possible! Sample Processing Variation Technical Variation E.g.: Two mice in two different cages “One characteristic common to all biological material is that it varies.” Finney, 1953 Control of Experiment Variance “If I had to replicate my experiments, I could only do half as much.” Botstein, 1999 Technical Replication •Technical variance  0 •High Precision experiment •Technical Replication: Estimation of technical Variation •biological effect inaccurate Biological Replication •Biological variance  0 •High accuracy experiment •Biological and technical variation are confounded •Measurement precision decreased Degree of Replication •Robustness of the method  Spot replication •Dye Swap  array replication •Robustness of the biological assay •Absolute Transcript frequency/signal intensity  Sample replication •Relative Transcript frequency associated with the biological effect  Sample replication •Cellular sample composition  Sample replication Statistical Analysis and Design The number of independent data points is a function of the comparison design: Single Color •Post hoc comparison • Two Color •Direct comparison • Post Hoc Design – Loop Design (Balanced) – – •Indirect comparison • 2 data point/gene/condition biological and technical variation not confounded 8 datapoints/gene/condition Reference Design (Unbalanced) – – – biological and technical variation not confounded Reference overrepresented 4 data point/gene/condition Pooling A reference design: the red and green arrows represent chips. from http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp A loop design: arrows represent chips with samples labeled as indicated. A saturated design w/o dye swap A design for a comparative study of the effect of a treatment on two biological strains with replicates and a few dye swaps from http://discover.nci.nih.gov/microarrayAnalysis/Experimental.Design.jsp Topic 2 Data Preprocessing • Background Correction • Normalization Background Correction • • • • None – DNA vs Substrate – No Imputation/Offset Local – Negative Signal Intensities likely – Imputation/Offset required Global – Negative Signal Intensities likely – Imputation/Offset required Moving Minimum – 3x3 spot average background – Negative Signal Intensities likely – Imputation/Offset required • • Edwards – log-linear interpolation of background intensities – Background Intensity insensitive – Test for Imputation Norm-Exp – regression based background estimation using Signal to Noise ratios – Background Intensity sensitive – No Imputation QuickTime™ and a decompressor are needed to see this picture. Normalization Background correction Expression ratio: Ti= Ri/Gi log2(ratio) log2(1) = 0, log2(2) = 1, log2(1/2) = -1, log2(4) = 2, log2(1/4) = -2 total intensity normalization: If one has a large random sample of genes most of which remain unchanged, one could normalize so that the mean ratio (T) for all spots is 1. (for the log2Ti correction this corresponds to a subtraction of a constant. see http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v32/n4s/full/ng1032.html&filetype=pdf ) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Two Color Analytical Plots Cond.1a Cond.2a Cond.1b Cond.2b Cond.1c Comparison Cond.2c Synth. Image Scatterplot Ratiohistogram typical depiction ratio versus intensity (log R +log G) QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. From: http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v32/n4s/full/ng1032.html&filetype=pdf after locally weighted linear regression analysis QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. From: http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v32/n4s/full/ng1032.html&filetype=pdf Beware Any data adjustment, even if it performed as sophisticated or industrious as possible, cannot convert low quality data into high quality data. Data adjustment always removes a part of the biology. !!Use it as sparingly as possible!! Filtering Data Outliers in the original data (in red) are excluded from the remainder of the data (blue) selected on the basis of a two-standarddeviation cut on the replicates. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. From: http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v32/n4s/full/ng1032.html&filetype=pdf Statistical Methods for Identifying Differentially Expressed Genes in Replicated Microarray Experiments Gene Expression Data represented as N x M Matrix Sample 1 Sample 2 Expression Signature Gene 1 Gene 2 Expression Profile Gene N Sample M N rows correspond to the N genes. M columns correspond to the M samples (microarray experiments). Each column = a sample or a replicate Example: Four replicate spots per array produces four column R/G ratio. If four replicate arrays are used, It will produce a 16 column matrix. Or 32 if R and G values are put separately. Student’s Test Statistics H0: The groups are not different 99% 95% 68% of all samples Naïve solution: do t-test for each gene. Multiplicity Problem: The probability of error increases. (Bonferoni correction too conservative!) Significance Analysis of Microarrays Linear Models for Microarray Data Package to analyze MA data. Good plot capabilities. semi-parametric hierarchical (SPH) mixture model Significance Analysis of Microarrays (SAM) uses balanced permutations (sample versus control intensities “re-labeling”) to generate an expectation for the comparison Volcano plots compare significance (Y-axis) against effect (x axis) •The plot compares significance determinations obtained with MAANOVA (MicroArray ANalysis Of VAriance) •On the plot, the y-axis value is -log10(Pvalue) for the F1 test. The x-axis value is proportional to the fold changes. •A horizontal line represents the significance threshold of the F1 test. •Blue dots: EE genes •Green dots: F3 •Orange dots: Fs •F2 (In example graph, F2 tests were not run.) Microarray Data: Clustering Clustering Assign n similar objects to groups Example: green/ red data points were generated from two different normal distributions Why cluster genes? • Identify groups of possibly co-regulated genes • Identify temporal or spatial gene expression patterns Why cluster experiments/samples? • Detect experimental artifacts/bad hybridizations • Identify new classes of biological samples (e.g. tumor subtypes) To Do Clustering You Need … Distance measure (Example: Intra-Cluster Distances gene expression # 1: x = (x1, …, for hierarchical clustering) xn), gene expression # 2: y = (y1, …, yn) • Euclidean: • Manhattan: n 2 ( x y )  i i d E ( x, y )  i 1 n d M ( x, y )  xi - yi . i 1 • Correlation: d C ( x, y )  1 -  ( x - x )( y i i 1 i - y)  (x - x)  ( y 2 i 1 i i 1 i - y) 2 . To Do Clustering You Also Need … Cluster Algorithm/Method (1) Hierarchical (2) Parametric (Partitioning) Basic Idea •small within-cluster distances • large between-cluster distances Hierarchical Clustering 1 5 2 3 4 Divisive 3 5 1 Agglomerative 1,2,3,4,5 4 1,2,5 3,4 1,5 2 1 5 2 3 4 Hierarchical clustering Clustered display of data from time course of serum stimulation of primary human fibroblasts (grown in culture and deprived of serum for 48 hr, serum was added back and samples taken at time 0, 15 min, 30 min, 1 hr, 2 hr, 3 hr, 4 hr, 8 hr, 12 hr, 16 hr, 20 hr, 24 hr). All measurements are relative to time 0. Genes were selected for this analysis if their expression level deviated from time 0 by at least a factor of 3.0 in at least 2 time points. Each gene is represented by a single row of colored boxes; each time point is represented by a single column. Labeled clusters contain multiple genes involved in (A) cholesterol biosynthesis, (B) the cell cycle, (C) the immediate-early response, (D) signaling and angiogenesis, and (E) wound healing and tissue remodeling. These clusters also contain named genes not involved in these processes and numerous uncharacterized genes. Eisen, Michael B. et al. (1998) Proc. Natl. Acad. Sci. USA 95, 14863-14868 Copyright ©1998 by the National Academy of Sciences Volcano Plot and Heatmaps QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Parametric Clustering (partitioning) • K-Means • K-Medoids (PAM) • SOM • Fuzzy-C Means Partitioning Algorithms: Basic Concept • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion – Global optimal: exhaustively enumerate all partitions – Heuristic methods: k-means and k-medoids algorithms – k-means (MacQueen’67): Each cluster is represented by the center of the cluster – k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster The K-Means Clustering Method • Given k, the k-means algorithm is implemented in 4 steps: – Partition objects into k nonempty subsets – Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. – Assign each object to the cluster with the nearest seed point. – Go back to Step 2, stop when no more new assignment. Parametric or Hierarchical (Non-Parametric)? Parametric: Advantages Optimal for certain criteria. Genes automatically assigned to clusters Disadvantages Need initial k; Often require long computation times. Every gene is assigned to a cluster. Hierarchical Advantages Faster computation. Visual Representation. Disadvantages Unrelated genes are eventually joined Rigid, cannot correct later for erroneous decisions made earlier. Hard to define clusters. Meta Analyses of MA data: Go Analysis Pathway Analysis QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. To do: For Friday –Read chapter 18 –Browse through http://jura.wi.mit.edu/bio/education/bioinfo/lecture10-color.pdf and –http://www.nature.com/cgitaf/DynaPage.taf?file=/ng/journal/v32/n4s/full/ng1032.html&filetype=pdf For Monday: –Refresh your memory on McRobot and Bayesian analyses –Go through quiz 8 (will be posted Friday/Saturday, due following Wednesday)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download microarrays