Download Microarray Data Analysis

Document related concepts
no text concepts found
Transcript
Previous Lecture: Proteomics Informatics
Example data – MALDI-TOF
45
700
Intensity
Intensity
1800
0
1000
D
:\U
sers\F
enyo\D
esktop\AT
P
.txt (15:42 02/03/11)
D
escription: none available
4500
m
/z
0
13
00
2280
14602400
m/z
D:\Users\Fenyo\Desktop\ATP.txt (15:5
:40
602/0
/03
3/1
/11
1))
Description: none available
Peptide intensity vs m/z
Intensity
700
35
0
0
2378.0
1444.0
D
:\U
se
e
.txt
D
:\U
sers\F
rs\Fe
en
nyo
yo\D
\D
esk
skto
top
p\AT
\ATP
P
.txt (1
(16
5:0
:57
40
02
2/0
/03
3/1
/11
1))
D
e
D
escrip
scriptio
tion
n::n
no
on
ne
ea
ava
vaila
ilab
ble
le
m
m/z
/z
239
4.0
1458.0
This Lecture
Gene Expression Analysis (I)
Learning Objectives
•
•
•
•
•
•
•
Microarray experimental details
Microarray data formats
QC analysis and data exploration
Normalization
Differential expression
Functional enrichment
Databases
The Central Dogma of Molecular Biology
DNA is transcribed into RNA which is then
translated into protein
DNA
transcription
RNA
translation
protein
replication
Measured by Microarray
What is a Microarray
• A simple concept: Dot Blot + Northern
• Reverse the hybridization - put the probes
on the filter and label the bulk RNA
• Make probes for lots of genes - a massively
parallel experiment
• Make it tiny so you don’t need so much
RNA from your experimental cells.
• Make quantitative measurements
Microarrays are Popular
 At NYU Med Center we are now collecting
about 3 GB of microarray data per week (60
chips, 6-10 different experiments)
 PubMed search "microarray"= 13,948 papers
 2005
 2004
 2003
 2002
 2001
 2000
= 4406
= 3509
= 2421
= 1557
= 834
= 294
5000
4500
4406
4000
3509
3500
3000
2500
2421
2000
1557
1500
1000
500
834
294
0
2000
2001
2002
2003
2004
2005
A Filter Array
DNA Chip Microarrays
• Put a large number (~100K) of cDNA sequences or
synthetic DNA oligomers onto a glass slide (or other
subtrate) in known locations on a grid.
• Label an RNA sample and hybridize
• Measure amounts of RNA bound to each square in
the grid
• Make comparisons
– Cancerous vs. normal tissue
– Treated vs. untreated
– Time course
• Many applications in both basic and clinical research
cDNA Microarray Technologies
• Spot cloned cDNAs onto a glass microscope
slide
– usually PCR amplified segments of plasmids
• Label 2 RNA samples with 2 different colors
of flourescent dye - control vs. experimental
• Mix two labeled RNAs and hybridize to the
chip
• Make two scans - one for each color
• Combine the images to calculate ratios of
amounts of each RNA that bind to each spot
Spot your own Chip
(plans available for free from Pat Brown’s website)
Robot spotter
Ordinary glass
microscope slide
Combine scans for Red & Green
False color image is made from digitized fluorescence data,
not by superimposing scanned images
cDNA Spotted Microarrays
Data Acquisition
•
•
•
•
•
Scan the arrays
Quantitate each spot
Subtract background
Normalize
Export a table of fluorescent intensities
for each gene in the array
Affymetrix “Gene chip” system
• Uses 25 base oligos synthesized in place on a
chip (20 pairs of oligos for each gene)
• RNA labeled and scanned in a single “color”
– one sample per chip
•
•
•
•
Can have as many as 20,000 genes on a chip
Arrays get smaller every year (more genes)
Chips are expensive
Proprietary system: “black box” software,
can only use their chips
Affymetrix Gene Chip
Affymetrix Technology
Affymetrix Software
• Affymetrix System is totally automated
• Computes a single value for each gene from 40
probes - (using surprisingly kludgy math)
• Highly reproducible
(re-scan of same chip or hyb. of duplicate chips with
same labeled sample gives very similar results)
• Incorporates false results due to image artefacts
– dust, bubbles
– pixel spillover from bright spot to neighboring dark
spots
Affymetrix Pivot Table
normal
ID_REF
VALUE
AFFX-BioB-5_at
210.6
AFFX-BioB-M_at
393
AFFX-BioB-3_at
264.9
AFFX-BioC-5_at
738.6
AFFX-BioC-3_at
356.3
AFFX-BioDn-5_at
566.3
AFFX-BioDn-3_at
3911.8
AFFX-CreX-5_at
6433.3
AFFX-CreX-3_at
11917.8
AFFX-DapX-5_at
12.2
AFFX-DapX-M_at
57.8
AFFX-DapX-3_at
29.8
AFFX-LysX-5_at
15.3
AFFX-LysX-M_at
33.2
AFFX-LysX-3_at
40.7
AFFX-PheX-5_at
7.8
AFFX-PheX-M_at
4.2
AFFX-PheX-3_at
54.2
AFFX-ThrX-5_at
8.2
AFFX-ThrX-M_at
38.1
AFFX-ThrX-3_at
15.2
AFFX-TrpnX-5_at
11.2
AFFX-TrpnX-M_at
9
AFFX-TrpnX-3_at
19.8
AFFX-HUMISGF3A/M97935_5_at
82.7
AFFX-HUMISGF3A/M97935_MA_at
397.6
AFFX-HUMISGF3A/M97935_MB_at
206.2
AFFX-HUMISGF3A/M97935_3_at
663.8
AFF X-HUMRGE/M10098_5_at
547.6
AFF X-HUMRGE/M10098_M_at
239.1
AFF X-HUMRGE/M10098_3_at
1236.4
AFFX-HUMGAPDH/M33197_5_at
19508
AFFX-HUMGAPDH/M33197_M_at
18996.6
AFFX-HUMGAPDH/M33197_3_at
18016.4
AFFX-HSAC07/X00351_5_at
23294.6
AFFX-HSAC07/X00351_M_at
25373.1
AFFX-HSAC07/X00351_3_at
20032.8
tumor
ABS_C ALL VALUE
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
M
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
234.6
327.8
164.6
676.1
365.9
442.2
3703.7
5980
9376.7
44.3
42.5
6.2
16.2
12
10.7
3
4.8
39.6
11.2
30.6
5
11.8
8.1
12.8
120.7
416.7
303
723.9
405.9
175.8
721.4
19267.1
20610.4
17463.8
21783.7
24922.8
20251.1
tumor
VALUE
362.5
501.4
244.7
737.6
423.4
649.7
4680.9
7734.7
11509.3
31.2
79
23.4
15.6
17.7
36.2
7.6
6.8
19.4
13.2
37.6
15
22.2
9.1
11.8
92.7
244.8
300.8
812.1
6894.7
3675
9076.1
22892
21573.7
20921.3
18423.3
22384.2
20961.7
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
normal
VALUE
389
816.5
379.7
1191.2
711.6
834.3
6037.7
10591
16814.4
37.7
48.8
28.4
16.7
37.3
22.1
5.6
6.1
16.1
9.5
7.2
8.3
22.1
8.7
43.2
46.4
181.4
253.5
666.1
3496.1
1348.6
7795.9
26584
29936
26908.3
21858.9
25760.2
23494.6
ABS_C ALL
P
P
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
M
P
A
P
P
P
P
P
P
P
P
P
P
P
normal
VALUE
305.6
542
261.3
917
560.3
599.1
4653.7
8162.1
13861.8
33.3
39.5
3.2
3.1
49.2
22.8
5
3.7
44.7
8.5
26.9
36.8
8.9
8.1
17.4
55.9
197.5
195.3
629.4
1958.5
695.9
4237.1
29666.6
30106.6
28382.2
23517.1
27718.5
23381.2
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
tumor
VALUE
330.5
440.8
303.7
767.9
484.9
606.9
4232
8428
13653.4
12.8
39.2
7.6
3.9
9.1
28.2
6.4
5.5
31.2
7.5
36.3
11.5
35.6
12
10
46.5
192.3
216
754.1
5799.4
2428.2
7890
25038.1
22380.2
21885
19450.3
21401.6
21173.3
A B S _C A LL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
Plot of raw data (PM probes)
Plot of log2 data (PM probes)
MA plot: log of fold change (M) vs log of Intensity (A)
M = log2 (A/B)
A = ½ log2 (A*B) = ½ (log2 (A) + log2 (B))
Hypox1 vs
Hypox2
Hypox3
Norm1
Norm2
Norm3
Goals of a Microarray
Experiment
1. Find the genes that change expression
between experimental and control
samples
2. Classify samples based on a gene
expression profile
3. Find patterns: Groups of biologically
related genes that change expression
together across samples/treatments
Basic Data Analysis
• Fold change (relative increase or decrease in
intensity for each gene)
• Set cutoff filter for low values
(background +noise)
• Cluster genes by similar changes - only really
meaningful across multiple treatments or
time points
• Cluster samples by similar gene expression
profiles
Streamlined Affy Analysis
normal
ID_REF
VALUE
AFFX-BioB-5_at
210.6
AFFX-BioB-M_at
393
AFFX-BioB-3_at
264.9
AFFX-BioC-5_at
738.6
AFFX-BioC-3_at
356.3
AFFX-BioDn-5_at
566.3
AFFX-BioDn-3_at
3911.8
AFFX-CreX-5_at
6433.3
AFFX-CreX-3_at
11917.8
AFFX-DapX-5_at
12.2
AFFX-DapX-M_at
57.8
AFFX-DapX-3_at
29.8
AFFX-LysX-5_at
15.3
AFFX-LysX-M_at
33.2
AFFX-LysX-3_at
40.7
AFFX-PheX-5_at
7.8
AFFX-PheX-M_at
4.2
AFFX-PheX-3_at
54.2
AFFX-ThrX-5_at
8.2
AFFX-ThrX-M_at
38.1
AFFX-ThrX-3_at
15.2
AFFX-TrpnX-5_at
11.2
AFFX-TrpnX-M_at
9
AFFX-TrpnX-3_at
19.8
AFFX-HUMISGF3A/M97935_5_at
82.7
AFFX-HUMISGF3A/M97935_MA_at
397.6
AFFX-HUMISGF3A/M97935_MB_at
206.2
AFFX-HUMISGF3A/M97935_3_at
663.8
AFF X-HUMRGE/M10098_5_at
547.6
AFF X-HUMRGE/M10098_M_at
239.1
AFF X-HUMRGE/M10098_3_at
1236.4
AFFX-HUMGAPDH/M33197_5_at
19508
AFFX-HUMGAPDH/M33197_M_at
18996.6
AFFX-HUMGAPDH/M33197_3_at
18016.4
AFFX-HSAC07/X00351_5_at
23294.6
AFFX-HSAC07/X00351_M_at
25373.1
AFFX-HSAC07/X00351_3_at
20032.8
tumor
ABS_C ALL VALUE
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
M
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
M
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
P
P
P
P
P
P
P
P
P
P
P
P
234.6
327.8
164.6
676.1
365.9
442.2
3703.7
5980
9376.7
44.3
42.5
6.2
16.2
12
10.7
3
4.8
39.6
11.2
30.6
5
11.8
8.1
12.8
120.7
416.7
303
723.9
405.9
175.8
721.4
19267.1
20610.4
17463.8
21783.7
24922.8
20251.1
tumor
VALUE
362.5
501.4
244.7
737.6
423.4
649.7
4680.9
7734.7
11509.3
31.2
79
23.4
15.6
17.7
36.2
7.6
6.8
19.4
13.2
37.6
15
22.2
9.1
11.8
92.7
244.8
300.8
812.1
6894.7
3675
9076.1
22892
21573.7
20921.3
18423.3
22384.2
20961.7
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
M
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
normal
VALUE
389
816.5
379.7
1191.2
711.6
834.3
6037.7
10591
16814.4
37.7
48.8
28.4
16.7
37.3
22.1
5.6
6.1
16.1
9.5
7.2
8.3
22.1
8.7
43.2
46.4
181.4
253.5
666.1
3496.1
1348.6
7795.9
26584
29936
26908.3
21858.9
25760.2
23494.6
ABS_C ALL
P
P
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
M
P
A
P
P
P
P
P
P
P
P
P
P
P
Raw data
normal
VALUE
305.6
542
261.3
917
560.3
599.1
4653.7
8162.1
13861.8
33.3
39.5
3.2
3.1
49.2
22.8
5
3.7
44.7
8.5
26.9
36.8
8.9
8.1
17.4
55.9
197.5
195.3
629.4
1958.5
695.9
4237.1
29666.6
30106.6
28382.2
23517.1
27718.5
23381.2
ABS_C ALL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
tumor
VALUE
330.5
440.8
303.7
767.9
484.9
606.9
4232
8428
13653.4
12.8
39.2
7.6
3.9
9.1
28.2
6.4
5.5
31.2
7.5
36.3
11.5
35.6
12
10
46.5
192.3
216
754.1
5799.4
2428.2
7890
25038.1
22380.2
21885
19450.3
21401.6
21173.3
A B S _C A LL
P
P
P
P
P
P
P
P
P
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
P
A
P
P
P
P
P
P
P
P
P
P
P
Significance
•t-test
•SAM
•Rank Product
Normalize
Filter
(RMA)
Classification
•PAM
•Machine learning
Gene lists
Function
(Genome Ontology)
•Present/Absent
•Minimum value
•Fold change
Clustering
Scatter plot of all genes in a
simple comparison of two
control (A) and two
treatments (B: high vs. low
glucose) showing changes in
expression greater than 2.2
and 3 fold.
Thomas Hudson, Montreal Genome Center
Normalization
• Can control for many of the experimental
sources of variability (systematic, not random
or gene specific)
• Bring each image to the same average
brightness
• Can use simple math or fancy – divide by the mean (whole chip or by sectors)
– LOESS (locally weighted regression)
• No sure biological standards
RMA
• Robust Multichip Average
•
Bolstad, B.M., Irizarry R. A., Astrand, M., and Speed, T.P. (2003), A Comparison of
Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and
Variance. Bioinformatics 19(2):185-193
log(medpol(PMij − BG)) = µ i + α j + e ij
for (array i, probe j)
Are the Treatments Different?
• Analysis of microarray data has tended to focus on
making lists of genes that are up or down regulated
between treatments
• Before making these lists, ask the question:
"Are the treatments different?"
• PCA/MDS or cluster the samples
• If the treatment is responsible for differences, then
use statistical methods to find the genes most
responsible
• If there are not significant overall differences, then
lists of genes with large fold changes may only reflect
random variability.
Statistics
• When you have variability in
measurements, you need replication and
statistics to find real differences
• It’s not just the genes with 2 fold increase,
but those with a significant p-value across
replicates
• Non-parametric (i.e. rank or permutation) or
paired value statistics may be more
appropriate (low number of samples, high
standard deviation)
Multiple Comparisons
• In a microarray experiment, each gene (each
probe or probe set) is really a separate
experiment
• Yet if you treat each gene as an independent
comparison, you will always find some with
significant differences
– (the tails of a normal distribution)
• Different genes are NOT independent
False Discovery
• Statisticians call false positives a "type 1 error" or a
"False Discovery"
• The FDR must be smaller than the number of real
differences that you find - which in turn depends on
the size of the differences and variability of the
measured expression values
• You can’t know the true false discovery rate for your
data, but it can be estimated in a number of different
ways.
• In biology we tend to be comfortable with an
estimated FDR of 5-10%
SAM
Significance Analysis of Microarrays
Tusher, Tibshirani and Chu (2001): Significance analysis of microarrays
applied to the ionizing radiation response. PNAS 2001 98: 5116-5121,
(Apr 24).
•R package, Excel plugin
•Free
•Permutation based
•Most published method of
microarray data analysis
SAM- procedure overview
Sample genes
expression
scale
Define and calculate
a statistic, d(i)
Generate permutated
samples
Estimate attributes
of d(i)’s distribution
Identify potentially
Significant genes
Estimate FDR
40
Choose
Δ
• Calculate “relative difference” – a value
that incorporates the change in expression
between conditions and the variation of
measurements in each condition
• Calculate “expected relative difference” –
derived from controls generated by
permutations of data
• Plot against each other, set cutoff to
identify deviating genes
• Calculate FDR for chosen cutoff from the
control permutations
Relative Difference
x I (i)  xU (i)
d(i) 
s(i)  s0
expression of gene i
xI (i), xU (i) Mean
in conditions I and U
Gene-specific scatter
s(i)
Constant to reduce variation of
s0
low expressed genes
SAM Two-Class Unpaired
Permutation tests
i)
For each gene, compute the d-value (similar to a t-statistic). This is the
observed d-value (di) for that gene.
ii) Randomly shuffle the expression values between groups A and B. Compute
the d-value for each randomized set.
iii) Take the average of the randomized d-values for each gene. This is the
‘expected relative difference’ (dE) of that gene. Difference between (di) and
(deE) is used to measure significance.
d (i)  d E (i)  
iv) Plot d(i) vs. dE(i)
v) Calculate FDR = average number of genes that exceed  in the permuted
data.
Group A
Group B
Exp 1 Exp 2 Exp 5
Exp 3
Exp 4 Exp 6
Original grouping
Gene 1
Group A
Exp 3 Exp 2
Gene 1
Group B
Exp 6
Exp 4 Exp 5 Exp 1
Randomized grouping
“Observed d = expected d” line
SAM Two-Class Unpaired
Significant positive genes
(i.e., mean expression of group B >
mean expression of group A)
• Plot d(i) vs. dE(i)
• For most of the
genes:
d (i)  d E (i)
Significant negative genes
(i.e., mean expression of group A > mean
expression of group B)
The more a gene deviates
from the “observed =
expected” line, the more
likely it is to be
significant. Any gene
beyond the first gene in
the +ve or –ve direction
on the x-axis (including
the first gene), whose
observed exceeds the
expected by at least delta,
is considered significant.
Higher Level
Microarray data analysis
•
•
•
•
•
Clustering and pattern detection
Data mining and visualization
Controls and normalization of results
Statistical validatation
Linkage between gene expression data and gene
sequence/function/metabolic pathways databases
• Discovery of common sequences in co-regulated
genes
• Meta-studies using data from multiple
experiments
Types of Clustering
• Herarchical
– Link similar genes, build up to a tree of all
• Self Organizing Maps (SOM)
– Split all genes into similar sub-groups
– Finds its own groups (machine learning)
• Principle Component
– every gene is a dimension (vector), find a single
dimension that best represents the differences in
the data
Cluster by
fold change
GeneSpring
SOM Clusters
Classification
 How to sort samples into two classes
based on gene expression data
 Cancer vs. normal
 Cancer sub-types
(benign vs. malignant)
 Responds well to drug vs. poor
response
(i.e. tamoxifen for breast cancer)
PAM: Prediction Analysis for Microarrays
Class Prediction and Survival Analysis for Genomic Expression Data Mining
Performs sample classification from gene expression data,
via "nearest shrunken centroid method'' of Tibshirani, Hastie, Narasimhan and Chu (2002):
"Diagnosis of multiple cancer types by shrunken centroids of gene expression"
PNAS 2002 99:6567-6572 (May 14).
BioConductor
 All of these normalization, statistical, and
clustering methods are available in a free
software package called BioConductor,
which is part of the R statistical environment
www.bioconductor.org
 command line interface
> data(SpikeIn)
> pms <- pm(SpikeIn)
> mms <- mm(SpikeIn)
> par(mfrow = c(1, 2))
> concentrations <- matrix(as.numeric(sampleNames(SpikeIn)), 20,
+ 12, byrow = TRUE)
> matplot(concentrations, pms, log = "xy", main = "PM", ylim = c(30,
+ 20000))
> lines(concentrations[1, ], apply(pms, 2, mean), lwd = 3)
> matplot(concentrations, mms, log = "xy", main = "MM", ylim = c(30,
+ 20000))
> lines(concentrations[1, ], apply(mms, 2, mean), lwd = 3)
Functional Genomics
 Take a list of "interesting" genes and
find their biological relationships
Gene lists may come from
significance/classfication analysis of
microarrays, proteomics, or other highthroughput methods
 Requires a reference set of "biological
knowledge"
Genome Ontology
 How to organize biological
knowledge?
Biologists work on a variety of
different research organisms:
yeast, fruit fly, mouse, … human
the same gene can have very
different functions (antennapedia)
and very different names
(sonic hedgehog…)
GO
 Biologists got together and developed a
sensible system called Genome Ontology
(GO)
 3 hierarchical sets of terminology
Biological Process
Cellular Component (location within cell)
Molecular Function
 about 1000 categories of functions
 List (and convert) gene identifiers from many
genomic resources including NCBI, PIR and
Uniprot/SwissProt as well as Illumina and
Affymetrix gene IDs
 Gene IDs matched to GO function
annotations (for human)
 Test for enrichment of GO categories (or
KEGG pathways, disease associations,
etc.) in list.
 Groups significant categories into clusters
DAVID enrichment score: EASE
 DAVID uses a modified Fishers Exact text to get p-
values for enrichment.
 Basic idea: is enrichment of this category in this list
greater than frequency of the category in the genome.
A Hypothetical Example:
In human genome background (20,000 gene total), 40 genes are involved in
p53 signaling pathway. A given gene list has found that 3 out of 300 belong to
p53 signaling pathway. Then we ask the question if 3/300 is more than
random chance comparing to the human background of 40/20000.
Fisher Exact P-Value = 0.008.
However, EASE Score is more conservative. EASE Score = 0.06 (using 3-1
instead of 3). Since P-Value > 0.01, this user gene list is specifically associated
(enriched) in p53 signaling pathway no more than random chance
Microarray Databases
• Large experiments may have hundreds of
individual array hybridizations
• Core lab at an institution or multiple
investigators using one machine - data
archive and validate across experiments
• Data-mining - look for similar patterns of
gene expression across different
experiments
Public Databases
• Gene Expression data is an essential aspect
of annotating the genome
• Publication and data exchange for
microarray experiments
• Data mining/Meta-studies
• Common data format - XML
• MIAME (Minimal Information About a
Microarray Experiment)
Array Express at EMBL
GEO at the NCBI
Sumary
•
•
•
•
•
•
•
Microarray experimental details
Microarray data formats
QC analysis and data exploration
Normalization
Differential expression
Functional enrichment
Databases
Next Lecture: Next Generation Sequencing Informatics