Download Intro to Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Heritability of IQ wikipedia , lookup

Cancer epigenetics wikipedia , lookup

X-inactivation wikipedia , lookup

Twin study wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Oncogenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Metagenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene desert wikipedia , lookup

Essential gene wikipedia , lookup

Pathogenomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Public health genomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Minimal genome wikipedia , lookup

Genome (book) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Introduction to analysis of
microarray data
David Edwards
AAR H U S
UNIVERSITET
Faculty of Agricultural Sciences
The Microarray Study Process
Study Objectives
 Class comparison: differential
expression
 Class prediction: classification
 Class discovery: clustering
Differential Expression
How to identify genes whose
expression level changes across
conditions in the study?
Analysis Strategy
The study may be to:

Compare two groups (eg treatment vs control)

Compare more than two groups

More than one comparison (eg 2 treatments at 3 timepoints)
As a first approximation, we can think of our approach as:
1. Choose the appropriate analysis method for a single gene
2. Apply to all genes, correcting for multiplicity (eg FDR).
Additive and multiplicative scales
 Most statistical models use additive scales and
constant variance
 Gene expression appears to work more on a
multiplicate scale (fold changes rather than
expression differences), and the variance in gene
expression depends on its absolute value.
 Conclusion: transform the data by taking
logarithms (conventionally base 2).
Fold Change & Log Ratios
We have transformed our data by taking logarithms! So differences are logratios (log fold changes)
log(a/b) = log(a) – log(b)
With two-channel (cDNA) data the numbers we analyze (usually) are the
within-spot log-ratios:
M = log(R) – log(G)
To estimate log fold change across replicate slides we compute the average
log-ratio across the replicates.
With one-channel (affy) data the numbers we analyze are the logs of the
expression measures (eg rma)
To estimate log fold change between two groups of arrays we compute the
average log-expression within each group and calculate the difference.
LR = ( Y1i)/n1 – ( Y2i)/n2
Analysis
then for gene 2, ... then for gene 20000.
Some examples of methods
 Two-sample t-test
 Linear regression
yt = y0 + ¯ Z
y0 baseline expression (before treatment)
Z (0=control, 1=treatment)
¯ group effect
 ANOVA models
 Non-parametric tests
 ....
Multiplicity
 Typically a list of p-values is obtained, one
per gene.
 Now we need to select the ones likely to
be differentially expressed.
 If we used p<0.05 as criterion this would
lead to 1000 (=0.05x20000) genes being
selected even though there was no
differential expression.
Multiplicity
 If select genes using the criterion p < ®/N,
where N is total no of genes, (Bonferroni’s
correction), this controls the familywise
error rate
Pr(any type I error)
= Pr(any false selections) < ®
 But this is usually too stringent.
False Discovery Rate
 FDR= Proportion of false positives within
selected genes.
 Two uses:
 If top 100 genes are selected for further study, what
proportion may be expected to be false positive?
 If we want a proportion of 5% false positives, how many
genes should be selected?
 Adjusted p-values can be defined (q-values) such that
selecting genes with qg<® results in FDR<®
LIMMA Package:
Linear Models for Microarray Data
Analysis of differential expression studies
 arbitrarily complex experiments: linear models,
contrasts
 empirical Bayes methods for differential expression: ttests, F-tests, posterior odds
 inference methods for duplicate spots, technical
replication
 analyse log-ratios or log-intensities
 spot quality weights
 control of FDR across genes and contrasts
 stemmed heat diagrams, Venn diagrams
 pre-processing: background correction, within and
between array normalization
Empirical Bayes Methods in Limma
 Problem with ordinary t-tests here:
small estimates of S.D. can arise by chance, giving
false positives.
 Limma uses an empirical Bayes approach:
 the gene variances are given a prior distribution (the sample
distribution). Each variance is then updated using the data to
obtain posterior distribution, and an an estimate is derived
from the posterior distribution.
 This shrinks the variances towards the prior mean. This
estimate is then substituted in classical t-statistics (the
”degrees of freedom” are adjusted), giving the so-called
moderated t-test.
 Good evidence that this is more robust than the classical
approach.
 Given a prior estimate p of the proportion of DE genes, the
posterior probability pg that a gene g is DE can be calculated.
The B-statistic given by Limma is the log-odds ie log(Og=pg/(1pg)). This is useful for ranking genes.
 Smyth, GK (2004). Linear Models and Empirical Bayes Methods
for Assessing Differential Expression in Microarray
Experiments, Stat. Appl. In Genetics and Mol. Biol., 3, 1.