* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

RNA interference wikipedia, lookup

Epitranscriptome wikipedia, lookup

History of RNA biology wikipedia, lookup

Pathogenomics wikipedia, lookup

Vectors in gene therapy wikipedia, lookup

Epigenetics of neurodegenerative diseases wikipedia, lookup

History of genetic engineering wikipedia, lookup

Public health genomics wikipedia, lookup

Genomic imprinting wikipedia, lookup

Genome (book) wikipedia, lookup

Primary transcript wikipedia, lookup

Epigenetics of diabetes Type 2 wikipedia, lookup

Ridge (biology) wikipedia, lookup

Non-coding RNA wikipedia, lookup

Genome evolution wikipedia, lookup

RNA silencing wikipedia, lookup

Long non-coding RNA wikipedia, lookup

Site-specific recombinase technology wikipedia, lookup

Metagenomics wikipedia, lookup

Microevolution wikipedia, lookup

Designer baby wikipedia, lookup

Nutriepigenomics wikipedia, lookup

Therapeutic gene modulation wikipedia, lookup

Epigenetics of human development wikipedia, lookup

Gene expression programming wikipedia, lookup

Mir-92 microRNA precursor family wikipedia, lookup

Artificial gene synthesis wikipedia, lookup

Transcript

Gene Expression - Microarrays Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May 2010 Compare gene expression in this cell type… …after viral infection …relative to a knockout …in samples from patients …after drug treatment …at a later developmental time …in a different body region Gene expression is context-dependent, and is regulated in several basic ways • by region (e.g. brain versus kidney) • in development (e.g. fetal versus adult tissue) • in dynamic response to environmental signals (e.g. immediate-early response genes) • in disease states • by gene activity Page 297 Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA) Microarrays: tools for gene expression A microarray is a solid support (such as a membrane or glass microscope slide) on which DNA of known sequence is deposited in a grid-like array. Page 312 Microarrays: tools for gene expression The most common form of microarray is used to measure gene expression. RNA is isolated from matched samples of interest. The RNA is typically converted to cDNA, labeled with fluorescence (or radioactivity), then hybridized to microarrays in order to measure the expression levels of thousands of genes. Measuring RNA abundances [email protected] How it works Complementary hybridization: - Put a part of the gene sequence on the array - convert mRNA to cDNA using reverse transcriptase [email protected] Spotted Arrays • Robot puts little spots of DNA on glass slides • Each spot is a DNA analog of the mRNA we want to detect [email protected] Spotted Arrays • Two channel technology for comparing two samples – relative measurements • Two mRNA samples (reference, test) are reverse transcribed to cDNA, labeled with fluorescent dyes (Cy3, Cy5) and allowed to hybridize to array [email protected] Spotted Arrays • Read out two images by scanning array with lasers, one for each dye [email protected] Oligonucleotide Arrays • One channel technology – absolute measurements • Instead of putting entire genes on array, put multiple oligonucleotide probes: short, fixed length DNA sequences (25-60 nucleotides) • Oligos are synthesized in situ • Affymetrix uses a photolithography process, similar to that used to make semiconductor chips • Other technologies available (e.g. mirror arrays) [email protected] Oligonucleotide Arrays • For each gene, construct a probeset – a set of n-mers to specific to this gene [email protected] Advantages of microarray experiments Fast Data on >20,000 transcripts within weeks Comprehensive Entire yeast or mouse genome on a chip Flexible Custom arrays can be made to represent genes of interest Easy Submit RNA samples to a core facility Cheap? Chip representing 20,000 genes for $300 Disadvantages of microarray experiments Cost ■ Some researchers can’t afford to do appropriate numbers of controls, replicates RNA ■ The final product of gene expression is protein significance ■ “Pervasive transcription” of the genome is poorly understood (ENCODE project) ■ There are many noncoding RNAs not yet represented on microarrays Quality control ■ Impossible to assess elements on array surface ■ Artifacts with image analysis ■ Artifacts with data analysis ■ Not enough attention to experimental design ■ Not enough collaboration with statisticians Sample acquisition Data acquisition Data analysis Data confirmation Biological insight Stage 1: Experimental design Stage 2: RNA and probe preparation Stage 3: Hybridization to DNA arrays Stage 4: Image analysis Stage 5: Microarray data analysis Stage 6: Biological confirmation Stage 7: Microarray databases Stage 1: Experimental design [1] Biological samples: technical and biological replicates: determine the data analysis approach at the outset [2] RNA extraction, conversion, labeling, hybridization: except for RNA isolation, routinely performed at core facilities [3] Arrangement of array elements on a surface: randomization can reduce spatially-based artifacts Page 314 Stage 2: RNA preparation For Affymetrix chips, need total RNA (about 5 ug) Confirm purity by running agarose gel Measure a260/a280 to confirm purity, quantity One of the greatest sources of error in microarray experiments is artifacts associated with RNA isolation; appropriately balanced, randomized experimental design is necessary. Stage 3: Hybridization to DNA arrays The array consists of cDNA or oligonucleotides Oligonucleotides can be deposited by photolithography The sample is converted to cRNA or cDNA (Note that the terms “probe” and “target” may refer to the element immobilized on the surface of the microarray, or to the labeled biological sample; for clarity, it may be simplest to avoid both terms.) Stage 4: Image analysis RNA transcript levels are quantitated Fluorescence intensity is measured with a scanner. Differential Gene Expression on a cDNA Microarray Control Rett a B Crystallin is over-expressed in Rett Syndrome Fig. 8.21 Page 319 Fig. 8.21 Page 319 Stage 5: Microarray data analysis Hypothesis testing • How can arrays be compared? • Which RNA transcripts (genes) are regulated? • Are differences authentic? • What are the criteria for statistical significance? Clustering • Are there meaningful patterns in the data (e.g. groups)? Classification • Do RNA transcripts predict predefined groups, such as disease subtypes? Page 318 Stage 6: Biological confirmation Microarray experiments can be thought of as “hypothesis-generating” experiments. The differential up- or down-regulation of specific RNA transcripts can be measured using independent assays such as -- Northern blots -- polymerase chain reaction (RT-PCR) -- in situ hybridization Page 320 Stage 7: Microarray databases There are two main repositories: Gene Expression Omnibus (GEO) at NCBI ArrayExpress at the European Bioinformatics Institute (EBI) Microarray Overview I Microtiter Plate Microbial ORFs Design PCR Primers Microarray Slide (with 60,000 or more spotted genes) + PCR Products Eukaryotic Genes Select cDNA clones PCR Products Many different plates containing different genes For each plate set, many identical replicas Microarray Overview II Measure Fluorescence in 2 channels red/green Control Test Prepare Fluorescently Labeled Probes Hybridize, Wash Analyze the data to identify patterns of gene expression Affymetrix GeneChip™ Expression Analysis Hybridize and wash chips Scan chips Control Analyze Test Obtain RNA Samples Prepare Fluorescently Labeled Probes PM MM Microarray Expression Analysis Tissue Selection Differential State/Stage Selection RNA Preparation and Labeling Competitive Hybridization Gene Spots on an Array Fluorescence Intensity Expression Measurement Steps in the Process Select array elements and annotate them Build a database to manage stuff Print arrays and manage the lab Hybridize and analyze images; manage data Analyze hybridization data and get results MIAME In an effort to standardize microarray data presentation and analysis, Alvis Brazma and colleagues at 17 institutions introduced Minimum Information About a Microarray Experiment (MIAME). The MIAME framework standardizes six areas of information: ►experimental design ►microarray design ►sample preparation ►hybridization procedures ►image analysis ►controls for normalization Visit http://www.mged.org Interpretation of RNA analyses The relationship of DNA, RNA, and protein: DNA is transcribed to RNA. RNA quantities and half-lives vary. There tends to be a low positive correlation between RNA and protein levels. The pervasive nature of transcription: The Encyclopedia of DNA Elements (ENCODE) project identified functional features of genomic DNA, initially in 30 megabases (1% of the human genome). One of its observations was the “pervasive nature of transcription”: the vast majority of DNA is transcribed, although the function is unknown. Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA) Microarray data analysis • begin with a data matrix (gene expression values versus samples) genes (RNA transcript levels) Microarray data analysis • begin with a data matrix (gene expression values versus samples) Typically, there are many genes (>> 20,000) and few samples (~ 10) Fig. 9.1 Page 333 Microarray data analysis • begin with a data matrix (gene expression values versus samples) Preprocessing Inferential statistics Descriptive statistics Microarray data analysis: preprocessing Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: • different labeling efficiencies of Cy3, Cy5 • uneven spotting of DNA onto an array surface • variations in RNA purity or quantity • variations in washing efficiency • variations in scanning efficiency Microarray data analysis: preprocessing The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. A basic assumption of most normalization procedures is that the average gene expression level does not change in an experiment. Data analysis: global normalization Global normalization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microrarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope. Data analysis: global normalization Global normalization is used to correct two or more data sets Example: total fluorescence in Cy3 channel = 4 million units Cy 5 channel = 2 million units Then the uncorrected ratio for a gene could show 2,000 units versus 1,000 units. This would artifactually appear to show 2-fold regulation. Data analysis: global normalization Global normalization procedure Step 1: subtract background intensity values (use a blank region of the array) Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets) Scatter plots Useful to represent gene expression values from two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes Differential Gene Expression in Different Tissue and Cell Types Fibroblast Brain Astrocyte Astrocyte Expression level (sample 2) high low Expression level (sample 1) Log-log transformation Scatter plots Typically, data are plotted on log-log coordinates Visually, this spreads out the data and offers symmetry time t=0 t=1h t=2h t=3h behavior basal no change 2-fold up 2-fold down raw ratio value 1.0 1.0 2.0 0.5 log2 ratio value 0.0 0.0 1.0 -1.0 expression level low high Log ratio up down Mean log intensity You can make these plots in Excel… …but for many bioinformatics applications use R. Visit http://www.r-project.org to download it. There are limits to what you can measure The Limits of log-ratios: The space we explore The Limits of log-ratios: The space we explore The Limits of log-ratios: The space we explore Good Data Bad Data from Parts Unknown Each “pin group” is colored differently Gary Churchill Lowess Normalization Why LOWESS? A SD = 0.346 1. Intensity-dependent structure 2. Data not mean centered at log2(ratio) = 0 Ratio Cy3/Cy5 for the same RNA sorted from least most expressed LOWESS Results Affymetrix Chips Mismatch (MM) probes • MM probes are used to measure background signals due to non-specific sources and scanner offset. • Using a MM probe as an estimate of background seems wrong and often the MM signal >= the PM signal • Some would claim that subtraction of the mismatch probe adds noise for little gain. Computing expression summaries: a three-step process • Background/Signal adjustment • Normalization (can happen at the probe-pair or the probe-set level). • Summarization of probe-pairs into probe-set or gene level information Background/Signal Adjustment • A method which does some or all of the following Corrects for background noise, processing effects Adjusts for cross hybridization Adjust estimated expression values to fall on proper scale • Probe intensities are used in background adjustment to compute correction (unlike cDNA arrays where area surrounding spot might be used) Normalization Methods • Complete data (no reference chip, information from all arrays used) Quantile normalization (Bolstadt al 2003) • Baseline (normalized using reference chip) Scaling (Affymetrix) Non linear (Li-Wong) Summarization • Reduce the 11-20 probe intensities on each array to a single number for gene expression • Main Approaches Single chip • AvDiff (Affymetrix) – no longer recommended for use due to many flaws • Mas5.0 (Affymetrix) –use a 1 step Tukey biweight to combine the probe intensities in log scale Multiple Chip •MBEI (Li-Wong dChip) –a multiplicative model •RMA –a robust multi-chip linear model fit on the log scale Robust multi-array analysis (RMA) • Developed by Rafael Irizarry (Dept. of Biostatistics), Terry Speed, and others • Available at www.bioconductor.org as an R package • Also available in various software packages (including Partek, www.partek.com and Iobion Gene Traffic) • See Bolstad et al. (2003) Bioinformatics 19; Irizarry et al. (2003) Biostatistics 4 There are three steps: [1] Background adjustment based on a normal plus exponential model (no mismatch data are used) [2] Quantile normalization (nonparametric fitting of signal intensity data to normalize their distribution) [3] Fitting a log scale additive model robustly. The model is additive: probe effect + sample effect GCRMA • GC-RMA is a modified version of RMA that models intensity of probe level data as a function of GC-content • expect to see higher intensity values for probes that are GC rich due to increased binding M M A A After RMA (a normalization procedure), the median is near zero, and skewing is corrected. Scatterplots display the effects of normalization. vsn: variance stabilizing normalization • Variance depends on signal intensity in microarray data • A transformation can be found after which the variance is approximately constant • Like the logarithm at the upper end of, approximately linear at the lower end • Also incorporates the estimation of "normalization" parameters (shift and scale) • Assumes that less than half of the genes on the arrays are differentially transcribed across the experiment. vsn: post-normalization plot Histograms of raw intensity values for 14 arrays (plotted in R) before and after RMA was applied. log signal intensity log signal intensity array array log intensity RMA can adjust for the effect of GC content GC content Robust multi-array analysis (RMA) RMA offers a large increase in precision (relative to Affymetrix MAS 5.0 software). log expression SD precision MAS 5.0 RMA average log expression Robust multi-array analysis (RMA) RMA offers comparable accuracy to MAS 5.0. observed log expression accuracy log nominal concentration Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA) Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level a to p < 0.05. Analyzing expression data Question: for each of my 20,000 transcripts, decide whether it is significantly regulated in some disease. control disease [1] Obtain a matrix of genes (rows) and expression values columns. Here there are 20,000 rows of genes of which the first six are shown. There are three control samples and three disease samples. Calculate the mean value for each gene (transcript) for the controls and the disease (experimental) samples. Analyzing expression data [2] Calculate the ratios of control versus disease. Also note that some ratios, such as 2.00, appear to be dramatic while others are not. Some researchers set a cut-off for changes of interest such as two-fold. A significant difference Probably not Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t= x1 – x2 SE = difference between mean values variability (standard error of the difference) Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t= x1 – x2 SE = difference between mean values variability (standard error of the difference) Notes • t is a ratio (it thus has no units) • We assume the two populations are Gaussian • The two groups may be of different sizes • Obtain a P value from t using a table • For a two-sample t test, the degrees of freedom is N - 2. • For any value of t, P gets smaller as df gets larger Analyzing expression data [3] Perform a t-test. Hypothesis is that the transcript in the disease group is up (or down) relative to controls. Analyzing expression data [3] Note the results: you can have… a small p value (<0.05) with a big ratio difference a small p value (<0.05) with a trivial ratio difference a large p value (>0.05) with a big ratio difference a large p value (>0.05) with a trivial ratio difference Inferential statistics Is it appropriate to set the significance level to p < 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to 0.05. You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes: p < (0.05)/10,000 or p < 5 x 10-6 The Bonferroni correction is generally considered to be too conservative. Inferential statistics: false discovery rate The false discovery rate (FDR) is a popular multiple corrections correction. A false positive (also called a type I error) is sometimes called a false discovery. The FDR equals the p value of the t-test times the number of genes measured (e.g. for 10,000 genes and a p value of 0.01, there are 100 expected false positives). You can adjust the false discovery rate. For example: FDR # regulated transcripts 0.1 100 0.05 45 0.01 20 # false discoveries 10 3 1 Would you report 100 regulated transcripts of which 10 are likely to be false positives, or 20 transcripts of which one is likely to be a false positive? Inferential statistics: other methods used • t-test for two sample groups, SAM and t-tests with permutation testing • ANOVA for multiple factors • Linear models with Bayesian moderation of variance Smyth G. (2004) “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments” • Simultaneous inference: multivariate t-distributions for simultaneous confidence intervals Hsu et al. (1996) “Multiple Comparisons: Theory and Methods” Hsu et al. (2006) “Screening for Differential Gene Expressions from Microarray Data” p value (treated versus control) A volcano plot displays both p values and fold change log fold change (treated/untreated) Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA) Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation What is a cluster? A cluster is a group that has homogeneity (internal cohesion) and separation (external isolation). The relationships between objects being studied are assessed by similarity or dissimilarity measures. samples (time points) genes Data matrix (20 genes and 3 time points from Chu et al., 1998) Software: SPLUS package t=2.0 t=0.5 t=0 3D plot (using S-PLUS softwar Descriptive statistics: clustering Clustering algorithms offer useful visual descriptions of microarray data. Genes may be clustered, or samples, or both. We will next describe hierarchical clustering. This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most dissimilar objects first). In each case, we end up with a tree having branches and nodes. Page 355 log2(cy5/cy3) Distance Is Defined by a Metric 3 0 -3 Distance Metric: Euclidean Pearson* D 1.4 -0.05 D 6.0 +1.00 log2(cy5/cy3) Distance is Defined by a Metric 2 0 -2 Distance Metric: Euclidean Pearson(r*-1) D 1.4 -0.90 D 4.2 -1.00 Distance Matrix Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 0 1.5 1.2 0.25 0.75 1.4 1.5 0 1.3 0.55 2.0 1.5 1.2 1.3 0 1.3 0.75 0.3 0.25 0.55 1.3 0 0.25 0.4 0.75 2.0 0.75 0.25 0 1.2 Gene6 Gene5 Gene4 Gene3 Gene2 Gene1 Once a distance metric has been selected, the starting point for all clustering methods is a “distance matrix” 1.4 1.5 0.3 0.4 1.2 0 The elements of this matrix are the pair-wise distances. Note that the matrix is symmetric about the diagonal. Agglomerative clustering 0 1 2 3 a b a,b c d e Adapted from Kaufman and Rousseeuw (1990) 4 Agglomerative clustering 0 1 2 a b a,b c d e d,e 3 4 Agglomerative clustering 0 1 2 3 a b a,b c d e c,d,e d,e 4 Agglomerative clustering 0 1 2 3 4 a b a,b a,b,c,d,e c d e c,d,e d,e …tree is constructed Divisive clustering a,b,c,d,e 4 3 2 1 0 Divisive clustering a,b,c,d,e c,d,e 4 3 2 1 0 Divisive clustering a,b,c,d,e c,d,e d,e 4 3 2 1 0 Divisive clustering a,b a,b,c,d,e c,d,e d,e 4 3 2 1 0 Divisive clustering a b a,b a,b,c,d,e c c,d,e d d,e e 4 3 2 1 0 …tree is constructed agglomerative 0 1 2 3 4 a b a,b a,b,c,d,e c c,d,e d d,e e 4 3 2 1 0 divisive Adapted from Kaufman and Rousseeuw (1990) 1 12 Agglomerative and divisive clustering sometimes give conflicting results, as shown here 1 12 Agglomerative Linkage Methods Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked. Three linkage methods that are commonly used are: Single Linkage Average Linkage Complete Linkage (HCL-6) Single Linkage Cluster-to-cluster distance is defined as the minimum distance between members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters. DAB = min ( d(ui, vj) ) where u A and v B for all i = 1 to NA and j = 1 to NB DAB (HCL-7) Average Linkage Cluster-to-cluster distance is defined as the average distance between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance. DAB = 1/(NANB) SS ( d(ui, vj) ) where u A and v B for all i = 1 to NA and j = 1 to NB DAB (HCL-8) Complete Linkage Cluster-to-cluster distance is defined as the maximum distance between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability. DAB = max ( d(ui, vj) ) where u A and v B for all i = 1 to NA and j = 1 to NB DAB (HCL-9) Comparison of Linkage Methods Single Average Complete Two-way clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000) x2 A a2 Euclidean distance 1 b2 B a’2 A’ Angle distance 0.5 Chord distance b’2 B’ a 0.5 1 a’1 b’1 1.5 a1 x1 b1 K-Means/Medians Clustering – 1 1. Specify number of clusters, e.g., 5. 2. Randomly assign genes to clusters. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 K-Means/Medians Clustering – 2 3. Calculate mean/median expression profile of each cluster. 4. Shuffle genes among clusters such that each gene is now in the cluster whose mean expression profile (calculated in step 3) is the closest to that gene’s expression profile. G3 G11 G6 G1 G8 G4 G7 G5 G2 G10 G9 G12 G13 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached. k-means is most useful when the user has an a priori hypothesis about the number of clusters the genes should belong to. K-Means / K-Medians Support (KMS) Because of the random initialization of K-Means/K-Means, clustering results may vary somewhat between successive runs on the same dataset. KMS helps us validate the clustering results obtained from K-Means/K-Medians. Run K-Means / K-Medians multiple times. The KMS module generates clusters in which the member genes frequently group together in the same clusters (“consensus clusters”) across multiple runs of K-Means / K-Medians. The consensus clusters consist of genes that clustered together in at least x% of the K-Means / Medians runs, where x is the threshold percentage input by the user. Principal components analysis (PCA) An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D For a matrix of m genes x n samples, create a new covariance matrix of size n x n Thus transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs). Principal components analysis (PCA): objectives • to reduce dimensionality • to determine the linear combination of variables • to choose the most useful variables (features) • to visualize multidimensional data • to identify groups of objects (e.g. genes/samples) • to identify outliers http://www.okstate.edu/artsci/botany/ordinate/PCA.htm http://www.okstate.edu/artsci/botany/ordinate/PCA.htm http://www.okstate.edu/artsci/botany/ordinate/PCA.htm http://www.okstate.edu/artsci/botany/ordinate/PCA.htm 1 12 High-throughput methods beyond microarrays [email protected] RNA-seq • Sequencing technology is making fast progress • Idea: sequencing is so cheap that we can sequence mRNA molecules directly “Digital Gene Expression” [email protected] RNA-seq [email protected] (a) After two rounds of poly(A) selection, RNA is fragmented to an average length of 200 nt by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming. The cDNA is then converted into a molecular library for Illumina/Solexa 1G sequencing, and the resulting 25-bp reads are mapped onto the genome. Normalized transcript prevalence is calculated with an algorithm from the ERANGE package. (b) Primary data from mouse muscle RNAs that map uniquely in the genome to a 1-kb region of the Myf6 locus, including reads that span introns. The RNA-Seq graph above the gene model summarizes the quantity of reads, so that each point represents the number of reads covering each nucleotide, per million mapped reads (normalized scale of 0–5.5 reads). (c) Detection and quantification of differential expression. Mouse poly(A)-selected RNAs from brain, liver and skeletal muscle for a 20-kb region of chromosome 10 containing Myf6 and its paralog Myf5, which are muscle specific. In muscle, Myf6 is highly expressed in mature muscle, whereas Myf5 is expressed at very low levels from a small number of cells. The specificity of RNA-Seq is high: Myf6 expression is known to be highly muscle specific, and only 4 reads out of 71 million total liver and brain mapped reads were assigned to the Myf6 gene model. RNA-seq [email protected] Acknowledgements • This presentation uses slides/graphics from: J. Pevsner (Johns Hopkins, http://www.bioinfbook.org) J. Quackenbush (DFCI, Harvard) C. Dewey (Wisconsin, http://www.biostat.wisc.edu/bmi576) [email protected]