Download Some slides adapted from J. Fridlyand

Document related concepts

The Cancer Genome Atlas wikipedia , lookup

Transcript
Analysis of Array CGH Data
by Hanni Willenbrock
Some slides adapted from J. Fridlyand
BioSys course: DNA Microarray Analysis – Lecture, 2007
Outline
• Introduction to comparative genomic hybridization (CGH) and
array CGH
• Data analysis approaches
- Breakpoint detection
- Loss and gain analysis
- Application of segmentation to testing
• Real data example 1: Application to a primary tumor dataset
• Real data example 2: comparative genomic profiling of
bacterial strains
1
Comparative Genomic Hybridization
• Study types :
-
Gain or loss of genetic material
To find variations in the genetic material
• Purposes:
-
-
Study of chromosomal aberrations often found in cancer and
developmental abnormalities.
Study of variations in the baseline sequence in a microbial
population (microbial comparative genomics).
PhD defense, October 27th 2006
2
A Variety of Genetic Alterations Underlie
Developmental Abnormalities and Disease
• Inappropriate gene activation or inactivation can be caused
by:
- Mutation
- Epigenetic gene silencing (e.g. addition of methyl groups)
- Reciprocal translocation (exchange of fragments between two
non-homologous chromosomes)
- Gain or loss of genetic material
Any of the above may lead to an oncogene activation or
to inactivation of a tumor suppressor.
3
Existing techniques for detecting structural
abnormalities
Albertson and Pinkel, Human Molecular Genetics, 2003
4
Some microarray platforms for copy number
analysis
• BAC arrays
• Affymetrix SNP chip (500 K)
• Representational oligonucleotide microarray analysis (ROMA)
in
• Whole genome tiling arrays
• Own design (NimbleGen/NimbleExpress)
5
Array CGH: BAC arrays
12 mm
HumArray3.1
2464 human BAC clones spotted in triplicates
164-196 kbp
6
Array CGH Maps DNA Copy Number Alterations to
Positions in the Genome
Test Genomic DNA
Reference Genomic DNA
Cot-1 DNA
Gain of DNA copies in
tumor
Loss of DNA copies in
tumor
Position on Sequence
7
Example:
Detection of DiGeorge region
(A) Detection of deletion in the DiGeorge region by FISH.
A chromosome 22 subtelomere probe (green) and the TUPLE1 probe
for the DiGeorge region (red) were hybridized to metaphase
chromosomes from a normal individual and an individual with the
deletion. The arrow indicates the missing red FISH signal on the deleted
chromosome.
(B) Array CGH copy number profile of chromosome 22 showing deletion
in the DiGeorge region (arrow).
Albertson and Pinkel, Human Molecular Genetics, 2003
8
Structural abnormalities
*
*HSR: homogeneously staining region
Albertson and Pinkel, Human Molecular Genetics, 2003
9
Tumor Genomes are Stable
Copy Number Profiles of a Tumor & Recurrence
10
Analysis of array CGH
Goal: To partition the clones into sets with the same copy number
and to characterize the genomic segments in terms of copy number.
Biological model: genomic rearrangements lead to gains or losses of
sizable contiguous parts of the genome, possibly spanning entire
chromosomes, or, alternatively, to focal high-level amplifications.
11
Varying genomic complexity
Breakpoints
12
Exercise Part I:
Plot and view array CGH data
DNA Microarray Analysis Course, 2007
13
Observed clone value and spatial coherence
N(-.3, .08^2)
N(.6, .1^2)
?
?
Useful to make use of the physical dependence of the nearby clones,
which translates into copy number dependence.
DNA Microarray Analysis Course, 2006
14
Expected log2 ratio as a function of
copy number change, normal cell contamination and ploidy
Reference ploidy=3
Reference ploidy=2
2.58
100%
2.0
50%
0.58
0.42
10%
0.58
0.07
0.0
0.38
15
Simulation Study
•
•
•
•
Many algorithms to choose from
Mainly evaluated only on limited examples
Few comparisons between algorithm performance
Choice of evaluation criteria:
- False breakpoint detection vs. missed breakpoints
- Sample type preferences (size of segments, noise, etc)
16
Methods for Segmentation
•
HMM: Hidden Markov Model (aCGH package)
Fit HMMs in which any state is reachable from any other state
(Fridlyand et al, JMVA, 2004).
•
CBS: Circular binary segmentation (DNAcopy package)
Tertiary splits of the chromosomes into contiguous regions of equal
copy number and assesses significance of the proposed splits by
using a permutation reference distribution (Olshen et al, Biostatistics,
2004).
•
GLAD: Gain and Loss Analysis of DNA (GLAD package)
Detects chromosomal breakpoints by estimating a piecewise constant
function that is based on adaptive weights smoothing (Hupe et al,
Bioinformatics, 2004).
17
Comparison Scheme
• Use of simulated data, where the truth is known
• The noise is controlled (see later slide)
One segment
True
breakpoint
false
predicted
breakpoint
18
Breakpoint Detection Accuracy
19
Exercise Part II:
Segmentation and breakpoint prediction
DNA Microarray Analysis Course, 2007
20
Merging segments
Note: that all procedures operate on individual
chromosomes, therefore resulting in a large number of
segments with mean values close to each other.
Additional Challenge: reduce number of segments by
merging the ones that are likely to correspond to the same
copy number.
This will facilitate inference of altered regions.
DNA Microarray Analysis Course, 2006
21
Merging
• For estimating actual copy number levels from segmentations
DNA Microarray Analysis Course, 2006
22
Segmentation and Merging
DNA Microarray Analysis Course, 2006
23
ROC Curve:
Identification of copy number alterations for varying thresholds
24
Exercise Part III:
Estimate copy number gain and losses
DNA Microarray Analysis Course, 2007
25
Using segmentation for testing
(phenotype association studies)
Example case:
Find clones (or whole segments) that are significantly differing in copy number
between two cancer subtypes.
Task:
Investigate whether incorporating spatial information (segmentation) into testing
for differential copy number increases detection power.
Data type:
Samples with either of 2 different phenotypes (e.g. 2 different cancer subtypes)
How:
Comparison of sensitivity and specificity using:
1. Original test statistic (no use of spatial information)
2. Segmented T-statistic derived from original log2 ratios
3. T-statistic computed from segmented log2 ratios
26
Simulation of Array CGH Data
Real biological variation considered:
• Breast cancer data used as model data
Segment length and copy number is taken from the empirical distribution observed in
breast cancer data (DNAcopy segmentation).
• Mixture of cells (sample is not pure)
Each sample was assigned a value, Pt: proportion of tumor cells, between 0.3 and
0.7 from a uniform distribution.
• Experimental noise is Gaussian
Standard deviations drawn from a uniform distribution between 0.1 and 0.2 to imitate
real data where the noise may vary between experiments.
• Cancer subtypes are heterogeneous
Certain aberrations characteristic for a cancer subtype may only exist in a percentage
of the patients with that cancer subtype. Thus, in each sample, segments with copy
number alterations (copy number not 2) was removed at random with probability
30%.
27
Testing samples (original values)
37.5%
x9
57.0%
x11
20 samples from either of 2 classes, red is true copy number, black dots are simulated
values, circles around example of heterogeneity
28
Testing samples (original values)
Red:
True different
clones
29
Testing: why is multiple testing necessary?
standard
p-value cutoff
for alpha=0.05
=> Many false positives
30
Testing: why is multiple testing necessary?
• Significance with random class assignments?
• By chance, many test statistics are below/above standard
significance thresholds
2.93
5.29
-3.99 (maximum deviating value)
31
The maxT Multiple Testing Correction
By repeating random class assigningment and
testing, e.g. 100 times, the following ”permutation
reference distribution” of maximum absolute test
statistic is obtained (maxT distribution):
We wish to control the family wise
error rate (FWER) at alpha=0.05 (5%
chance of 1 false positive).
Therefore, the cut-off should be such
that only in 5% of the random cases,
we will get one false positive (95
percentile): cutoff = 5
standard
significance
threshold
MaxT multiple
testing corrected
threshold
32
Testing samples (original values)
maxT pvalue
cutoff for
alpha =
0.05
standard
p-value cutoff
for alpha=0.05
33
Testing: Segmenting test statistics
Reference
34
Testing segmented samples
............
............
1. Segmentation of individual samples...
35
Testing segmented samples
Reference
2. T-statistic from segmented individual samples...
36
Detecting regions with differential copy number
Willenbrock and Fridlyand. Bioinformatics 2005; 21(22): 4084-91
37
Variation of Simulation Parameters
• Signal2noise
- CBS consistently the best performance
- HMM has the highest FDR
- GLAD is least sensitive
• Alternative empirical distributions of segment lengths
- HMM has highest sensitivity for segment sizes below 10
- CBS has highest sensitivity for segment sizes 10 or larger
- GLAD consistently performes the worst
• Outlier detection
38
Real Data Example 1: Primary Tumor Data
• 75 oral squamous cell carcinomas (SCCs)
• TP53 mutational status of all samples was determined using
sequence information (Snijders et al., 2005)
• Tasks:
- Characterize wild-type and
mutant samples with respect
to their genomic alterations
- Build a classifier to predict
TP53 mutational status
39
Frequency of Gain/Loss Comparisons
Threshold-based
5% altered
Merge-based
33% altered
40
Why such a difference in alteration frequency?
+ 2.5x MAD
- 2.5x MAD
Willenbrock and Fridlyand. Bioinformatics 2005; 21(22): 4084-91
• High threshold-based cut-off is due to the high
experimental noise of the paraffin-embedded tumors
41
Classification results
Willenbrock and Fridlyand. Bioinformatics 2005; 21(22): 4084-91
42
Real Data Example 2:
Comparative genomic profiling of several
Escherichia coli strains
•
The microarray design included probes for:
-
•
7 known E. coli strains
39 known E. coli bacteriophages
104 known E. coli virulence genes
Experimentally:
-
2 sequenced control strains (W3110 and EDL933), 3
replicates
2 non-sequenced strains (D1 and 3538), 3 replicates
Bacteriophage: 3538 (stx2::cat), 2 replicates
43
Comparative Genomic Profiling:
challenges
• Ratio problems: some genes might be present on query
strain but not on the known reference strain.
• Single channel microarrays or dual channel microarrays?
- In this case, we used an Affymetrix single channel custom made
array (NimbleExpress)
• Partly present genes versus similar but different genes.
44
Homology between the 7 E. coli strains
included on the microarray
•
Very high similarity
between the two K12 strains and
between the two
O157:H7 strains.
•
Percentage of
homologues for E.
coli genomes in
columns found in
E. coli genomes in
rows.
Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
45
BLAST Atlas
Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
46
Hybridization Atlases
Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
•
Probe hybridizations for experiments (samples) result in a similar pattern as
expected from the BLAST atlas.
47
Mapping the phage Φ3538 (stx2::cat)
Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
48
Zoom of phage Φ3538 (stx2::cat)
Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
•
The hybridization pattern is very similar for the phage, strain 3538 and strain
D1.
49
Hierarchical Cluster Analysis
K-12
• D1 is very similar to the K-12 type strains (W3110 + MG1655).
E. coli virulence genes
Willenbrock et al. Journal of Bacteriology. 2006 Nov;188(22):7713-21.
• D1 is probably still a commensal strain (An organism participating in a symbiotic relationship
from which it benefits while the other is unaffected).
51
Summary
• Comparative genomic profiling of two E. coli strains
- 0175:H16 D1
- 0157:H7 3538
• Identification of virulence genes and phage elements
Conclusions:
• D1 is similar to the K-12 type strains
• Characterization of D1 and 3538 genes:
- Identification of a number of genes involved in DNA transfer and
recombination
52
Advantages over Conventional Expression Arrays
1. Hybridization of DNA to microarray (DNA is much more
stable)
2. Little normalization is necessary
3. Use of spatial coherence in the analysis
4. Only 1 sample is necessary to draw conclusions (it is still
necessary with biological replicates to be able to draw
general conclusions regarding a certain biological subtype)
5. Results may be easier interpretable and correlated with
sample phenotypes (e.g. loss of oncogene repressor ->
certain cancer subtype)
53
Summary
• Numerous methods have been introduced for segmentation of
DNA copy number data and breakpoint identification. It is
important to benchmark them against existing methods
(however, only feasible if the software is publicly available)
• Currently, CBS (DNAcopy package) has the best overall
performance
• Use of spatial dependency in the analysis improves testing
power on clone-by-clone basis
• Merging of segmentation results improves copy number
phenotype characterization
• Study types:
- Study of copy number in cancer samples
- Study of samples from patients with mental diseases
- Comparison of bacterial strains
54
Questions?
Exercise Part IV + Bonus exercise:
Real data analysis
DNA Microarray Analysis Course, 2007
55