* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Intro to Analysis
Heritability of IQ wikipedia , lookup
Cancer epigenetics wikipedia , lookup
X-inactivation wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Oncogenomics wikipedia , lookup
Metagenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene desert wikipedia , lookup
Essential gene wikipedia , lookup
Pathogenomics wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Public health genomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome evolution wikipedia , lookup
Minimal genome wikipedia , lookup
Genome (book) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genomic imprinting wikipedia , lookup
Ridge (biology) wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Introduction to analysis of microarray data David Edwards AAR H U S UNIVERSITET Faculty of Agricultural Sciences The Microarray Study Process Study Objectives Class comparison: differential expression Class prediction: classification Class discovery: clustering Differential Expression How to identify genes whose expression level changes across conditions in the study? Analysis Strategy The study may be to: Compare two groups (eg treatment vs control) Compare more than two groups More than one comparison (eg 2 treatments at 3 timepoints) As a first approximation, we can think of our approach as: 1. Choose the appropriate analysis method for a single gene 2. Apply to all genes, correcting for multiplicity (eg FDR). Additive and multiplicative scales Most statistical models use additive scales and constant variance Gene expression appears to work more on a multiplicate scale (fold changes rather than expression differences), and the variance in gene expression depends on its absolute value. Conclusion: transform the data by taking logarithms (conventionally base 2). Fold Change & Log Ratios We have transformed our data by taking logarithms! So differences are logratios (log fold changes) log(a/b) = log(a) – log(b) With two-channel (cDNA) data the numbers we analyze (usually) are the within-spot log-ratios: M = log(R) – log(G) To estimate log fold change across replicate slides we compute the average log-ratio across the replicates. With one-channel (affy) data the numbers we analyze are the logs of the expression measures (eg rma) To estimate log fold change between two groups of arrays we compute the average log-expression within each group and calculate the difference. LR = ( Y1i)/n1 – ( Y2i)/n2 Analysis then for gene 2, ... then for gene 20000. Some examples of methods Two-sample t-test Linear regression yt = y0 + ¯ Z y0 baseline expression (before treatment) Z (0=control, 1=treatment) ¯ group effect ANOVA models Non-parametric tests .... Multiplicity Typically a list of p-values is obtained, one per gene. Now we need to select the ones likely to be differentially expressed. If we used p<0.05 as criterion this would lead to 1000 (=0.05x20000) genes being selected even though there was no differential expression. Multiplicity If select genes using the criterion p < ®/N, where N is total no of genes, (Bonferroni’s correction), this controls the familywise error rate Pr(any type I error) = Pr(any false selections) < ® But this is usually too stringent. False Discovery Rate FDR= Proportion of false positives within selected genes. Two uses: If top 100 genes are selected for further study, what proportion may be expected to be false positive? If we want a proportion of 5% false positives, how many genes should be selected? Adjusted p-values can be defined (q-values) such that selecting genes with qg<® results in FDR<® LIMMA Package: Linear Models for Microarray Data Analysis of differential expression studies arbitrarily complex experiments: linear models, contrasts empirical Bayes methods for differential expression: ttests, F-tests, posterior odds inference methods for duplicate spots, technical replication analyse log-ratios or log-intensities spot quality weights control of FDR across genes and contrasts stemmed heat diagrams, Venn diagrams pre-processing: background correction, within and between array normalization Empirical Bayes Methods in Limma Problem with ordinary t-tests here: small estimates of S.D. can arise by chance, giving false positives. Limma uses an empirical Bayes approach: the gene variances are given a prior distribution (the sample distribution). Each variance is then updated using the data to obtain posterior distribution, and an an estimate is derived from the posterior distribution. This shrinks the variances towards the prior mean. This estimate is then substituted in classical t-statistics (the ”degrees of freedom” are adjusted), giving the so-called moderated t-test. Good evidence that this is more robust than the classical approach. Given a prior estimate p of the proportion of DE genes, the posterior probability pg that a gene g is DE can be calculated. The B-statistic given by Limma is the log-odds ie log(Og=pg/(1pg)). This is useful for ranking genes. Smyth, GK (2004). Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments, Stat. Appl. In Genetics and Mol. Biol., 3, 1.