* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gene expression: Microarray data analysis
Survey
Document related concepts
Site-specific recombinase technology wikipedia , lookup
Primary transcript wikipedia , lookup
Non-coding RNA wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Public health genomics wikipedia , lookup
Microevolution wikipedia , lookup
RNA silencing wikipedia , lookup
Designer baby wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Metagenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression programming wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Transcript
Gene expression: Microarray data analysis Compare gene expression in this cell type… …after viral infection …relative to a knockout …in samples from patients …after drug treatment …at a later developmental time …in a different body region Gene expression is context-dependent, and is regulated in several basic ways • by region (e.g. brain versus kidney) • in development (e.g. fetal versus adult tissue) • in dynamic response to environmental signals (e.g. immediate-early response genes) • in disease states • by gene activity DNA protein RNA DNA cDNA RNA cDNA UniGene SAGE microarray next-generation sequencing!!! protein UniGene: unique genes via ESTs • Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene • UniGene clusters contain many ESTs • UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution. Microarrays: tools for gene expression A microarray is a solid support (such as a membrane or glass microscope slide) on which DNA of known sequence is deposited in a grid-like array. Microarrays: tools for gene expression The most common form of microarray is used to measure gene expression. RNA is isolated from matched samples of interest. The RNA is typically converted to cDNA, labeled with fluorescence (or radioactivity), then hybridized to microarrays in order to measure the expression levels of thousands of genes. Advantages of microarray experiments Fast Data on >20,000 transcripts in ~2 weeks Comprehensive Entire yeast or mouse genome on a chip Flexible Custom arrays can be made to represent genes of interest Easy Submit RNA samples to a core facility Cheap! Chip representing 20,000 genes for $300 Disadvantages of microarray experiments Cost ■ Some researchers can’t afford to do appropriate numbers of controls, replicates (x100) RNA ■ The final product of gene expression is protein significance ■ “Pervasive transcription” of the genome is poorly understood (ENCODE project) ■ There are many noncoding RNAs not yet represented on microarrays Quality control ■ Impossible to assess elements on array surface ■ Artifacts with image analysis ■ Artifacts with data analysis ■ Not enough attention to experimental design ■ Not enough collaboration with statisticians Sample acquisition Data acquisition Data analysis Data confirmation Biological insight Stage 1: Experimental design Stage 2: RNA and probe preparation Stage 3: Hybridization to DNA arrays Stage 4: Image analysis Stage 5: Microarray data analysis Stage 6: Biological confirmation Stage 7: Microarray databases Stage 1: Experimental design [1] Biological samples: technical and biological replicates: determine the data analysis approach at the outset [2] RNA extraction, conversion, labeling, hybridization: except for RNA isolation, routinely performed at core facilities [3] Arrangement of array elements on a surface: randomization can reduce spatially-based artifacts One sample per array (e.g. Affymetrix or radioactivity-based platforms) Sample 1 Sample 2 Sample 3 Two samples per array (competitive hybridization) Samples 1,2 Samples 1,3 Sample 1, pool Sample 2, pool Samples 2,3 Samples 2,1: switch dyes Stage 2: RNA preparation For Affymetrix chips, need total RNA (about 5 ug) Confirm purity by running agarose gel Measure a260/a280 to confirm purity, quantity One of the greatest sources of error in microarray experiments is artifacts associated with RNA isolation; be sure to create an appropriately balanced, randomized experimental design. Stage 3: Hybridization to DNA arrays The array consists of cDNA or oligonucleotides Oligonucleotides can be deposited by photolithography The sample is converted to cRNA or cDNA (Note that the terms “probe” and “target” may refer to the element immobilized on the surface of the microarray, or to the labeled biological sample; for clarity, it may be simplest to avoid both terms.) Microarrays: array surface Southern et al. (1999) Nature Genetics, microarray supplement Stage 4: Image analysis RNA transcript levels are quantitated Fluorescence intensity is measured with a scanner, or radioactivity with a phosphorimager Differential Gene Expression on a cDNA Microarray Control Rett α B Crystallin is over-expressed in Rett Syndrome Stage 5: Microarray data analysis Hypothesis testing • How can arrays be compared? • Which RNA transcripts (genes) are regulated? • Are differences authentic? • What are the criteria for statistical significance? Clustering • Are there meaningful patterns in the data (e.g. groups)? Classification • Do RNA transcripts predict predefined groups, such as disease subtypes? Stage 6: Biological confirmation Microarray experiments can be thought of as “hypothesis-generating” experiments. The differential up- or down-regulation of specific RNA transcripts can be measured using independent assays such as -- Northern blots -- polymerase chain reaction (RT-PCR) -- in situ hybridization Stage 7: Microarray databases There are two main repositories: Gene expression omnibus (GEO) at NCBI ArrayExpress at the European Bioinformatics Institute (EBI) Array Express at the European Bioinformatics Institute http://www.ebi.ac.uk/arrayexpress/ MIAME In an effort to standardize microarray data presentation and analysis, Alvis Brazma and colleagues at 17 institutions introduced Minimum Information About a Microarray Experiment (MIAME). The MIAME framework standardizes six areas of information: ►experimental design ►microarray design ►sample preparation ►hybridization procedures ►image analysis ►controls for normalization Visit http://www.mged.org Microarray data analysis • begin with a data matrix (gene expression values versus samples) genes (RNA transcript levels) Microarray data analysis • begin with a data matrix (gene expression values versus samples) Typically, there are many genes (>> 20,000) and few samples (~ 10) Microarray data analysis • begin with a data matrix (gene expression values versus samples) Preprocessing Inferential statistics Descriptive statistics Microarray data analysis: preprocessing Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: • different labeling efficiencies of Cy3, Cy5 • uneven spotting of DNA onto an array surface • variations in RNA purity or quantity • variations in washing efficiency • variations in scanning efficiency Microarray data analysis: preprocessing The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. A basic assumption of most normalization procedures is that the average gene expression level does not change in an experiment. Data analysis: global normalization Global normalization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microrarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope. Data analysis: global normalization Global normalization is used to correct two or more data sets Example: total fluorescence in Cy3 channel = 4 million units Cy5 channel = 2 million units Then the uncorrected ratio for a gene could show 2,000 units versus 1,000 units. This would artifactually appear to show 2-fold regulation. Data analysis: global normalization Global normalization procedure Step 1: subtract background intensity values (use a blank region of the array) Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets) Scatter plots Useful to represent gene expression values from two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes Differential Gene Expression in Different Tissue and Cell Types Fibroblast Brain Astrocyte Astrocyte Expression level (sample 2) high low Expression level (sample 1) Log-log transformation Scatter plots Typically, data are plotted on log-log coordinates Visually, this spreads out the data and offers symmetry time t=0 behavior basal raw ratio value 1.0 log2 ratio value 0.0 t=1h no change 1.0 0.0 t=2h 2-fold up 2.0 1.0 t=3h 2-fold down 0.5 -1.0 expression level low high Log ratio up down Mean log intensity You can make these plots in Excel… …but for many bioinformatics applications use R. Visit http://www.r-project.org to download it. Visit http://www.r-project.org to download it. See chapter 9 (2nd edition) for a tutorial on microarray data analysis. M M A A After RMA (a normalization procedure), the median is near zero, and skewing is corrected. Scatterplots display the effects of normalization. SNOMAD converts array data to scatter plots http://www.snomad.org 1 EXP 30 EXP 2 20 10 Log-log plot 0 -1 0 -2 0 10 20 30 40 -2 CON -1 0 1 2 CON EXP > CON 1.0 0.5 2-fold 0.0 2-fold EXP < CON Log10 (Ratio ) Linear-linear plot 40 -0.5 -1.0 -1 0 1 Mean ( Log10 ( Intensity ) ) SNOMAD corrects local variance artifacts residual Log10 ( Ratio ) 0.5 0.0 -0.5 -1.0 0.5 2-fold 0.0 2-fold EXP < CON robust local regression fit EXP > CON 1.0 Corrected Log10 ( Ratio ) [residuals] 1.0 -0.5 -1.0 -1 0 1 Mean ( Log10 ( Intensity ) ) -1 0 1 Mean ( Log10 ( Intensity ) ) SNOMAD describes regulated genes in Z-scores Local Log10 ( Ratio ) Z-Score 10 5 Corrected Log10 ( Ratio ) 2 0 Locally estimated standard deviation of positive ratios 1 -5 2-fold Z= 1 0 Z= -1 Z= 5 -10 -2.0 2-fold -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 Mean ( Log10 ( Intensity ) ) -1 2 Locally estimated standard deviation of negative ratios -1.5 -1.0 -0.5 0.0 0.5 Mean ( Log10 ( Intensity ) ) 1.0 1.5 Corrected Log10 ( Ratio ) -2 -2.0 Z= 5 Z= 2 1 Z= 1 2-fold 0 2-fold Z= -1 -1 Z= -5 Z= -2 Z= -5 -2 -2.0 -1.5 -1.0 -0.5 0.0 0.5 Mean ( Log10 ( Intensity ) ) 1.0 1.5 Robust multi-array analysis (RMA) • Developed by Rafael Irizarry, Terry Speed, and others • Available at www.bioconductor.org as an R package • Also available in various software packages (including Partek, www.partek.com and Iobion Gene Traffic) • See Bolstad et al. (2003) Bioinformatics 19; Irizarry et al. (2003) Biostatistics 4 There are three steps: [1] Background adjustment based on a normal plus exponential model (no mismatch data are used) [2] Quantile normalization (nonparametric fitting of signal intensity data to normalize their distribution) [3] Fitting a log scale additive model robustly. The model is additive: probe effect + sample effect 14 12 10 8 6 6 7 8 9 10 11 12 13 14 log signal intensity Histograms of raw intensity values for 14 arrays (plotted in R) before and after RMA was applied. 14 array 12 5 10 4 8 3 6 2 4 log signal intensity 1 1 2 3 4 5 6 7 8 9 array 10 11 12 13 14 log intensity RMA adjusts for the effect of GC content GC content precision accuracy Good performance: reproducibility of the result Good quality of the result (relative to a gold standard) precision with accuracy Robust multi-array analysis (RMA) RMA offers a large increase in precision (relative to Affymetrix MAS 5.0 software). log expression SD precision MAS 5.0 RMA average log expression Robust multi-array analysis (RMA) observed log expression RMA offers comparable accuracy to MAS 5.0. accuracy log nominal concentration Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level α to p < 0.05. Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t= x1 – x2 SE difference between mean values = variability (standard error of the difference) Questions: Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t= Notes x1 – x2 SE difference between mean values = variability (standard error of the difference) • t is a ratio (it thus has no units) • We assume the two populations are Gaussian • The two groups may be of different sizes • Obtain a P value from t using a table • For a two-sample t test, the degrees of freedom is N -2. For any value of t, P gets smaller as df gets larger Analyzing expression data in Excel Question: for each of my 20,000 transcripts, decide whether it is significantly regulated in some disease. control disease [1] Obtain a matrix of genes (rows) and expression values columns. Here there are 20,000 rows of genes of which the first six are shown. There are three control samples and three disease samples. You can also calculate the mean value for each gene (transcript) for the controls and the disease (experimental) samples. Analyzing expression data in Excel [2] You can calculate the ratios of control versus disease. Note that you can use the formula =E5/I5 in this case to divide the mean control and disease values. Also note that some ratios, such as 2.00, appear to be dramatic while others are not. Some researchers set a cutoff for changes of interest such as two-fold. Analyzing expression data in Excel [3] Perform a t-test. When you enter =TTEST into the function box above, a dialog box appears. Enter the range of values for controls and for disease samples, and specify a 1or 2-tailed test. Analyzing expression data in Excel [3] Perform a t-test (continued). For a one-tailed test, your prior hypothesis is that the transcript in the disease group is up (or down) relative to controls; the change is unidirectional. For example, in Down syndrome samples you might hypothesize that chromosome 21 transcripts are significantly up-regulated because of the extra copy of chromosome 21. Analyzing expression data in Excel [3] Perform a t-test (continued). For a two-tailed test, you hypothesize that the two groups are different, but you do not know in which direction. Analyzing expression data in Excel [3] Note the results: you can have… a small p value (<0.05) with a big ratio difference a small p value (<0.05) with a trivial ratio difference a large p value (>0.05) with a big ratio difference a large p value (>0.05) with a trivial ratio difference Only the first group is worth reporting! Why? t-test to determine statistical significance disease vs normal Error difference between mean of disease and normal t statistic = variation due to error Inferential statistics Paradigm Parametric test Nonparametric Compare two unpaired groups Unpaired t-test Mann-Whitney test Compare two paired groups Paired t-test Wilcoxon test Compare 3 or more groups ANOVA ANOVA partitions total data variability Before partitioning After partitioning Subject disease vs normal disease vs normal Error Error Tissue type variation between DS and normal F ratio = variation due to error Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation What is a cluster? A cluster is a group that has homogeneity (internal cohesion) and separation (external isolation). The relationships between objects being studied are assessed by similarity or dissimilarity measures. samples (time points) genes Data matrix (20 genes and 3 time points from Chu et al., 1998) Software: SPLUS package t=2.0 t=0.5 t=0 3D plot (using S-PLUS software) Descriptive statistics: clustering Clustering algorithms offer useful visual descriptions of microarray data. Genes may be clustered, or samples, or both. We will next describe hierarchical clustering. This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most dissimilar objects first). In each case, we end up with a tree having branches and nodes. Agglomerative clustering 0 1 2 3 a b a,b c d e Adapted from Kaufman and Rousseeuw (1990) 4 Agglomerative clustering 0 1 2 a b a,b c d d,e e 3 4 Agglomerative clustering 0 1 2 3 a b a,b c c,d,e d d,e e 4 Agglomerative clustering 0 1 2 3 4 a b a,b a,b,c,d,e c c,d,e d d,e e …tree is constructed Divisive clustering a,b,c,d,e 4 3 2 1 0 Divisive clustering a,b,c,d,e c,d,e 4 3 2 1 0 Divisive clustering a,b,c,d,e c,d,e d,e 4 3 2 1 0 Divisive clustering a,b a,b,c,d,e c,d,e d,e 4 3 2 1 0 Divisive clustering a b a,b a,b,c,d,e c c,d,e d d,e e 4 3 2 1 0 …tree is constructed agglomerative 0 1 2 3 4 a b a,b a,b,c,d,e c c,d,e d d,e e 4 3 2 1 0 divisive Adapted from Kaufman and Rousseeuw (1990) 1 12 Agglomerative and divisive clustering sometimes give conflicting results, as shown here 1 12 Cluster and TreeView clustering K means SOM PCA Cluster and TreeView Two-way clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000) Principal components analysis (PCA) An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D For a matrix of m genes x n samples, create a new covariance matrix of size n x n Thus transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs). Principal components analysis (PCA): objectives • to reduce dimensionality • to determine the linear combination of variables • to choose the most useful variables (features) • to visualize multidimensional data • to identify groups of objects (e.g. genes/samples) • to identify outliers http://www.okstate.edu/artsci/botany/ordinate/PCA.htm http://www.okstate.edu/artsci/botany/ordinate/PCA.htm http://www.okstate.edu/artsci/botany/ordinate/PCA.htm http://www.okstate.edu/artsci/botany/ordinate/PCA.htm Copyright notice Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.