Download Gene expression: Microarray data analysis

Document related concepts

Gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Primary transcript wikipedia , lookup

Non-coding RNA wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Public health genomics wikipedia , lookup

Microevolution wikipedia , lookup

RNA silencing wikipedia , lookup

Designer baby wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Metagenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Gene expression:
Microarray data analysis
Compare gene expression
in this cell type…
…after viral
infection
…relative
to a knockout
…in samples
from patients
…after drug
treatment
…at a later
developmental time
…in a different
body region
Gene expression is context-dependent,
and is regulated in several basic ways
• by region (e.g. brain versus kidney)
• in development (e.g. fetal versus adult tissue)
• in dynamic response to environmental signals
(e.g. immediate-early response genes)
• in disease states
• by gene activity
DNA
protein
RNA
DNA
cDNA
RNA
cDNA
UniGene
SAGE
microarray
next-generation sequencing!!!
protein
UniGene: unique genes via ESTs
• Find UniGene at NCBI:
www.ncbi.nlm.nih.gov/UniGene
• UniGene clusters contain many ESTs
• UniGene data come from many cDNA libraries.
Thus, when you look up a gene in UniGene
you get information on its abundance
and its regional distribution.
Microarrays: tools for gene expression
A microarray is a solid support (such as a membrane
or glass microscope slide) on which DNA of known
sequence is deposited in a grid-like array.
Microarrays: tools for gene expression
The most common form of
microarray is used to measure
gene expression. RNA is isolated
from matched samples
of interest. The RNA is typically
converted to cDNA,
labeled with fluorescence (or
radioactivity), then hybridized
to microarrays in order to
measure the expression levels
of thousands of genes.
Advantages of microarray experiments
Fast
Data on >20,000 transcripts in ~2 weeks
Comprehensive
Entire yeast or mouse genome on a chip
Flexible
Custom arrays can be made
to represent genes of interest
Easy
Submit RNA samples to a core facility
Cheap!
Chip representing 20,000 genes for $300
Disadvantages of microarray experiments
Cost
■ Some researchers can’t afford to do
appropriate numbers of controls, replicates (x100)
RNA
■ The final product of gene expression is protein
significance ■ “Pervasive transcription” of the genome is
poorly understood (ENCODE project)
■ There are many noncoding RNAs not yet
represented on microarrays
Quality
control
■ Impossible to assess elements on array surface
■ Artifacts with image analysis
■ Artifacts with data analysis
■ Not enough attention to experimental design
■ Not enough collaboration with statisticians
Sample
acquisition
Data
acquisition
Data
analysis
Data
confirmation
Biological insight
Stage 1: Experimental design
Stage 2: RNA and probe preparation
Stage 3: Hybridization to DNA arrays
Stage 4: Image analysis
Stage 5: Microarray data analysis
Stage 6: Biological confirmation
Stage 7: Microarray databases
Stage 1: Experimental design
[1] Biological samples: technical and biological replicates:
determine the data analysis approach at the outset
[2] RNA extraction, conversion, labeling, hybridization:
except for RNA isolation, routinely performed at core facilities
[3] Arrangement of array elements on a surface:
randomization can reduce spatially-based artifacts
One sample per array
(e.g. Affymetrix or radioactivity-based platforms)
Sample 1
Sample 2
Sample 3
Two samples per array (competitive hybridization)
Samples 1,2
Samples 1,3
Sample 1, pool Sample 2, pool
Samples 2,3
Samples 2,1:
switch dyes
Stage 2: RNA preparation
For Affymetrix chips, need total RNA (about 5 ug)
Confirm purity by running agarose gel
Measure a260/a280 to confirm purity, quantity
One of the greatest sources of error in microarray
experiments is artifacts associated with RNA isolation;
be sure to create an appropriately balanced,
randomized experimental design.
Stage 3: Hybridization to DNA arrays
The array consists of cDNA or oligonucleotides
Oligonucleotides can be deposited by photolithography
The sample is converted to cRNA or cDNA
(Note that the terms “probe” and “target” may refer to the
element immobilized on the surface of the microarray, or
to the labeled biological sample; for clarity, it may be
simplest to avoid both terms.)
Microarrays: array surface
Southern et al. (1999) Nature Genetics, microarray supplement
Stage 4: Image analysis
RNA transcript levels are quantitated
Fluorescence intensity is measured with a scanner,
or radioactivity with a phosphorimager
Differential Gene Expression on a cDNA Microarray
Control
Rett
α B Crystallin
is over-expressed
in Rett Syndrome
Stage 5: Microarray data analysis
Hypothesis testing
• How can arrays be compared?
• Which RNA transcripts (genes) are regulated?
• Are differences authentic?
• What are the criteria for statistical significance?
Clustering
• Are there meaningful patterns in the data (e.g. groups)?
Classification
• Do RNA transcripts predict predefined groups, such as
disease subtypes?
Stage 6: Biological confirmation
Microarray experiments can be thought of as
“hypothesis-generating” experiments.
The differential up- or down-regulation of specific RNA
transcripts can be measured using independent assays
such as
-- Northern blots
-- polymerase chain reaction (RT-PCR)
-- in situ hybridization
Stage 7: Microarray databases
There are two main repositories:
Gene expression omnibus (GEO) at NCBI
ArrayExpress at the European Bioinformatics Institute
(EBI)
Array Express at the European Bioinformatics Institute
http://www.ebi.ac.uk/arrayexpress/
MIAME
In an effort to standardize microarray data presentation
and analysis, Alvis Brazma and colleagues at 17
institutions introduced Minimum Information About a
Microarray Experiment (MIAME). The MIAME
framework standardizes six areas of information:
►experimental design
►microarray design
►sample preparation
►hybridization procedures
►image analysis
►controls for normalization
Visit http://www.mged.org
Microarray data analysis
• begin with a data matrix (gene expression values
versus samples)
genes
(RNA
transcript
levels)
Microarray data analysis
• begin with a data matrix (gene expression values
versus samples)
Typically, there are
many genes
(>> 20,000) and
few samples (~ 10)
Microarray data analysis
• begin with a data matrix (gene expression values
versus samples)
Preprocessing
Inferential statistics
Descriptive statistics
Microarray data analysis: preprocessing
Observed differences in gene expression could be
due to transcriptional changes, or they could be
caused by artifacts such as:
• different labeling efficiencies of Cy3, Cy5
• uneven spotting of DNA onto an array surface
• variations in RNA purity or quantity
• variations in washing efficiency
• variations in scanning efficiency
Microarray data analysis: preprocessing
The main goal of data preprocessing is to remove
the systematic bias in the data as completely as
possible, while preserving the variation in gene
expression that occurs because of biologically
relevant changes in transcription.
A basic assumption of most normalization procedures
is that the average gene expression level does not
change in an experiment.
Data analysis: global normalization
Global normalization is used to correct two or more
data sets. In one common scenario, samples are
labeled with Cy3 (green dye) or Cy5 (red dye) and
hybridized to DNA elements on a microrarray. After
washing, probes are excited with a laser and detected
with a scanning confocal microscope.
Data analysis: global normalization
Global normalization is used to correct two or more
data sets
Example: total fluorescence in
Cy3 channel = 4 million units
Cy5 channel = 2 million units
Then the uncorrected ratio for a gene could show
2,000 units versus 1,000 units. This would artifactually
appear to show 2-fold regulation.
Data analysis: global normalization
Global normalization procedure
Step 1: subtract background intensity values
(use a blank region of the array)
Step 2: globally normalize so that the average ratio = 1
(apply this to 1-channel or 2-channel data sets)
Scatter plots
Useful to represent gene expression values from
two microarray experiments (e.g. control, experimental)
Each dot corresponds to a gene expression value
Most dots fall along a line
Outliers represent up-regulated or down-regulated genes
Differential Gene Expression
in Different Tissue and Cell Types
Fibroblast
Brain
Astrocyte
Astrocyte
Expression level (sample 2)
high
low
Expression level (sample 1)
Log-log
transformation
Scatter plots
Typically, data are plotted on log-log coordinates
Visually, this spreads out the data and offers symmetry
time
t=0
behavior
basal
raw ratio
value
1.0
log2 ratio
value
0.0
t=1h
no change
1.0
0.0
t=2h
2-fold up
2.0
1.0
t=3h
2-fold down
0.5
-1.0
expression level
low
high
Log ratio
up
down
Mean log intensity
You can make these plots in Excel…
…but for many bioinformatics applications use R.
Visit http://www.r-project.org to download it.
Visit http://www.r-project.org to download it. See chapter
9 (2nd edition) for a tutorial on microarray data analysis.
M
M
A
A
After RMA (a normalization
procedure), the median is
near zero, and skewing is
corrected.
Scatterplots display the
effects of normalization.
SNOMAD converts array data to scatter plots
http://www.snomad.org
1
EXP
30
EXP
2
20
10
Log-log
plot
0
-1
0
-2
0
10
20
30
40
-2
CON
-1
0
1
2
CON
EXP > CON
1.0
0.5
2-fold
0.0
2-fold
EXP < CON
Log10 (Ratio )
Linear-linear
plot
40
-0.5
-1.0
-1
0
1
Mean ( Log10 ( Intensity ) )
SNOMAD corrects local variance artifacts
residual
Log10 ( Ratio )
0.5
0.0
-0.5
-1.0
0.5
2-fold
0.0
2-fold
EXP < CON
robust local
regression fit
EXP > CON
1.0
Corrected Log10 ( Ratio )
[residuals]
1.0
-0.5
-1.0
-1
0
1
Mean ( Log10 ( Intensity ) )
-1
0
1
Mean ( Log10 ( Intensity ) )
SNOMAD describes regulated genes in Z-scores
Local Log10 ( Ratio ) Z-Score
10
5
Corrected Log10 ( Ratio )
2
0
Locally estimated standard
deviation of positive ratios
1
-5
2-fold
Z= 1
0
Z= -1
Z= 5
-10
-2.0
2-fold
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
Mean ( Log10 ( Intensity ) )
-1
2
Locally estimated standard
deviation of negative ratios
-1.5
-1.0
-0.5
0.0
0.5
Mean ( Log10 ( Intensity ) )
1.0
1.5
Corrected Log10 ( Ratio )
-2
-2.0
Z= 5
Z= 2
1
Z= 1
2-fold
0
2-fold
Z= -1
-1
Z= -5
Z= -2
Z= -5
-2
-2.0
-1.5
-1.0
-0.5
0.0
0.5
Mean ( Log10 ( Intensity ) )
1.0
1.5
Robust multi-array analysis (RMA)
• Developed by Rafael Irizarry, Terry Speed, and others
• Available at www.bioconductor.org as an R package
• Also available in various software packages (including
Partek, www.partek.com and Iobion Gene Traffic)
• See Bolstad et al. (2003) Bioinformatics 19;
Irizarry et al. (2003) Biostatistics 4
There are three steps:
[1] Background adjustment based on a normal plus
exponential model (no mismatch data are used)
[2] Quantile normalization (nonparametric fitting of
signal intensity data to normalize their distribution)
[3] Fitting a log scale additive model robustly. The model
is additive: probe effect + sample effect
14
12
10
8
6
6
7
8
9
10
11
12
13
14
log signal intensity
Histograms of raw
intensity values for 14
arrays (plotted in R)
before and after RMA
was applied.
14
array
12
5
10
4
8
3
6
2
4
log signal intensity
1
1
2
3
4
5
6
7
8
9
array
10
11
12
13
14
log intensity
RMA adjusts for the effect of GC content
GC content
precision
accuracy
Good performance:
reproducibility of
the result
Good quality
of the result
(relative to a
gold standard)
precision with
accuracy
Robust multi-array analysis (RMA)
RMA offers a large increase in precision (relative to
Affymetrix MAS 5.0 software).
log expression SD
precision
MAS 5.0
RMA
average log expression
Robust multi-array analysis (RMA)
observed log expression
RMA offers comparable accuracy to MAS 5.0.
accuracy
log nominal concentration
Inferential statistics
Inferential statistics are used to make inferences
about a population from a sample.
Hypothesis testing is a common form of inferential
statistics. A null hypothesis is stated, such as:
“There is no difference in signal intensity for the gene
expression measurements in normal and diseased
samples.” The alternative hypothesis is that there
is a difference.
We use a test statistic to decide whether to accept or
reject the null hypothesis. For many applications,
we set the significance level α to p < 0.05.
Inferential statistics
A t-test is a commonly used test statistic to assess
the difference in mean values between two groups.
t=
x1 – x2
SE
difference between mean values
=
variability (standard error
of the difference)
Questions:
Is the sample size (n) adequate?
Are the data normally distributed?
Is the variance of the data known?
Is the variance the same in the two groups?
Is it appropriate to set the significance level to p < 0.05?
Inferential statistics
A t-test is a commonly used test statistic to assess
the difference in mean values between two groups.
t=
Notes
x1 – x2
SE
difference between mean values
=
variability (standard error
of the difference)
• t is a ratio (it thus has no units)
• We assume the two populations are Gaussian
• The two groups may be of different sizes
• Obtain a P value from t using a table
• For a two-sample t test, the degrees of freedom is N -2.
For any value of t, P gets smaller as df gets larger
Analyzing expression data in Excel
Question: for each of my 20,000 transcripts, decide whether
it is significantly regulated in some disease.
control
disease
[1] Obtain a matrix of genes (rows) and expression values
columns. Here there are 20,000 rows of genes of which the
first six are shown. There are three control samples and
three disease samples. You can also calculate the mean
value for each gene (transcript) for the controls and the
disease (experimental) samples.
Analyzing expression data in Excel
[2] You can calculate the ratios of control versus disease.
Note that you can use the formula =E5/I5 in this case to
divide the mean control and disease values.
Also note that some ratios, such as 2.00, appear to be
dramatic while others are not. Some researchers set a cutoff for changes of interest such as two-fold.
Analyzing expression data in Excel
[3] Perform a t-test. When
you enter =TTEST into the
function box above, a
dialog box appears. Enter
the range of values for
controls and for disease
samples, and specify a 1or 2-tailed test.
Analyzing expression data in Excel
[3] Perform a t-test (continued). For a one-tailed test, your
prior hypothesis is that the transcript in the disease group
is up (or down) relative to controls; the change is
unidirectional. For example, in Down syndrome samples
you might hypothesize that chromosome 21 transcripts
are significantly up-regulated because of the extra copy of
chromosome 21.
Analyzing expression data in Excel
[3] Perform a t-test (continued). For a two-tailed test, you
hypothesize that the two groups are different, but you do
not know in which direction.
Analyzing expression data in Excel
[3] Note the results: you can have…
a small p value (<0.05) with a big ratio difference
a small p value (<0.05) with a trivial ratio difference
a large p value (>0.05) with a big ratio difference
a large p value (>0.05) with a trivial ratio difference
Only the first group is worth reporting! Why?
t-test to determine statistical significance
disease
vs normal
Error
difference between mean of disease and normal
t statistic =
variation due to error
Inferential statistics
Paradigm
Parametric test
Nonparametric
Compare two
unpaired groups
Unpaired t-test
Mann-Whitney test
Compare two
paired groups
Paired t-test
Wilcoxon test
Compare 3 or
more groups
ANOVA
ANOVA partitions total data variability
Before partitioning
After partitioning
Subject
disease
vs normal
disease
vs normal
Error
Error
Tissue type
variation between DS and normal
F ratio =
variation due to error
Descriptive statistics
Microarray data are highly dimensional: there are
many thousands of measurements made from a small
number of samples.
Descriptive (exploratory) statistics help you to find
meaningful patterns in the data.
A first step is to arrange the data in a matrix.
Next, use a distance metric to define the relatedness
of the different data points. Two commonly used
distance metrics are:
-- Euclidean distance
-- Pearson coefficient of correlation
What is a cluster?
A cluster is a group that has homogeneity (internal
cohesion) and separation (external isolation). The
relationships between objects being studied are
assessed by similarity or dissimilarity measures.
samples (time points)
genes
Data matrix
(20 genes and
3 time points
from Chu et al.,
1998)
Software: SPLUS package
t=2.0
t=0.5
t=0
3D plot (using S-PLUS software)
Descriptive statistics: clustering
Clustering algorithms offer useful visual descriptions
of microarray data.
Genes may be clustered, or samples, or both.
We will next describe hierarchical clustering.
This may be agglomerative (building up the branches
of a tree, beginning with the two most closely related
objects) or divisive (building the tree by finding the
most dissimilar objects first).
In each case, we end up with a tree having branches
and nodes.
Agglomerative clustering
0
1
2
3
a
b
a,b
c
d
e
Adapted from Kaufman and Rousseeuw (1990)
4
Agglomerative clustering
0
1
2
a
b
a,b
c
d
d,e
e
3
4
Agglomerative clustering
0
1
2
3
a
b
a,b
c
c,d,e
d
d,e
e
4
Agglomerative clustering
0
1
2
3
4
a
b
a,b
a,b,c,d,e
c
c,d,e
d
d,e
e
…tree is constructed
Divisive clustering
a,b,c,d,e
4
3
2
1
0
Divisive clustering
a,b,c,d,e
c,d,e
4
3
2
1
0
Divisive clustering
a,b,c,d,e
c,d,e
d,e
4
3
2
1
0
Divisive clustering
a,b
a,b,c,d,e
c,d,e
d,e
4
3
2
1
0
Divisive clustering
a
b
a,b
a,b,c,d,e
c
c,d,e
d
d,e
e
4
3
2
1
0
…tree is constructed
agglomerative
0
1
2
3
4
a
b
a,b
a,b,c,d,e
c
c,d,e
d
d,e
e
4
3
2
1
0
divisive
Adapted from Kaufman and Rousseeuw (1990)
1
12
Agglomerative and
divisive clustering
sometimes give conflicting
results, as shown here
1
12
Cluster and TreeView
clustering
K means SOM PCA
Cluster and TreeView
Two-way
clustering
of genes (y-axis)
and cell lines
(x-axis)
(Alizadeh et al.,
2000)
Principal components analysis (PCA)
An exploratory technique used to reduce the
dimensionality of the data set to 2D or 3D
For a matrix of m genes x n samples, create a new
covariance matrix of size n x n
Thus transform some large number of variables into
a smaller number of uncorrelated variables called
principal components (PCs).
Principal components analysis (PCA): objectives
• to reduce dimensionality
• to determine the linear combination of variables
• to choose the most useful variables (features)
• to visualize multidimensional data
• to identify groups of objects (e.g. genes/samples)
• to identify outliers
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
Copyright notice
Many of the images in this powerpoint presentation
are from Bioinformatics and Functional Genomics
by Jonathan Pevsner (ISBN 0-471-21004-8).
Copyright © 2003 by John Wiley & Sons, Inc.