Download Copy Number Variation (CNV)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Global Variation in Copy Number
in the Human Genome
Nature, Genome Research, Genome Research,
2006.
Speaker: Yao-Ting Huang
References





Redon et al. Global variation in copy number in the human
genome. Nature, 2006.
Fiegler et al. Accurate and reliable high-throughput detection
of copy number variation in the human genome. Genome
Research, 2006.
Komura et al. Genome-wide detection of human copy number
variations using high-density DNA oligonucleotide arrays.
Genome Research, 2006.
Komura et al. Noise Reduction from genotyping microarrays
using probe level information. In Silico Biology, 2006.
Price et al. SW-Array: a dynamic programming solution for
the identification of copy-number changes in genomic DNA
using array comparative genome hybridization data. NAR,
2005.
2
Copy Number Variation (CNV)

Copy Number Variation (CNV) is a DNA
segment
with length at least 1kb and
 presents at variable copy number compared with a
reference genome.


The cause of a CNV is speculated due to nonallelic homologous recombination.

CNVs may disrupt genes, alter gene dosage, and
confer risk to complex diseases such as HIV-1.
3
Examples of CNVs (1)
…
Paternal Maternal
Copy # = 2 Copy # = 2
Paternal Maternal
Copy # = 2 Copy # = 2
A
T
A
A
T
A
C
…
Offspring
Copy # = 1
Offspring
Copy # = 3
Deletion
Duplication
4
Examples of CNVs (2)
Paternal Maternal
Copy # = 3 Copy # = 3

Hard to tell the actual
type of a CNV even
within a family.
Offspring Offspring
Copy # = 2 Copy # = 4
Mendelian inheritance,
deletion, duplication.
5
Use of Two Array Platforms

(1) Whole Genome TilePath array (93.7% of
euchromatin); (2) Affymetrix 500K SNP array.
6
Results

There are a total of 1,447 CNVs identified and
merged from these two arrays.
913 CNVs from tiling array and 980 CNVs from SNP
genotyping array.
 These CNVs cover 360Mb (12%) of the human
genome.


The mean sizes of CNVs are 341kb in tiling array
and 206 kb in SNP array.

The use of large insert clones (~170kb) on tiling array
tends to overestimate the size of CNV.
7
Strength and Weakness of these Two
Arrays

The 500k SNP array is better for detecting smaller
CNVs.

The tiling array has more power than SNP array in
segmental-duplicated region.
8
Location of CNVs

CNVs are preferentially located outside of genes and
ultra-conserved elements.
Types of Sequences
WGTP CNVs
500K CNVs
RefSeq (~25,000 genes)
2,561
1,139
OMIN (1,961 genes)
251
112
48
16
116,678
59,397
Ultra-conserved elements
(481 elements)
Conserved non-coding
elements
9
Other Results


48% of gaps in the human genome assembly
are flanked or overlapped by CNVs.
24% of 1,447 CNVs are associated with
segmental duplications.


A portion of segmental duplications are CNVs and
thus will not be fixed in the human genome.
12% of 1,447 CNVs are validated by locusspecific quantitative assay (e.g., quantitative
PCR).
10
Linkage Disequilibrium between biallelic CNVs and Tag SNPs

Linkage disequilibrium between bi-allelic CNVs and
flanking SNPs can guide the selection of tag SNPs.


e.g, the copy number of CNV1 can be predicted by SNP2.
A single SNP array is sufficient to detect both SNP and CNV.
SNP1
C
A
C
C
A
SNP2
A
C
A
C
A
CNV1
Copy # = 2
Copy # = 3
Copy # = 2
Copy # = 3
Copy # = 2
SNP3
C
G
C
G
C
SNP4
C
C
T
C
C
11
Linkage Disequilibrium between biallelic CNVs and Tag SNPs

Linkage disequilibrium between bi-allelic CNVs and
flanking SNPs can guide the selection of tag SNPs.

e.g., Suppose SNP2 is selected as tag SNP.
SNP1
C
A
C
C
A
SNP2
A
C
A
C
A
CNV1
Copy # = 2
Copy # = 3
Copy # = 2
Copy # = 3
Copy # = 2
SNP3 SNP
SNP42
C
CA
G
C
C
T
G Copy
C# = 2
C
C
SNP2
C
Copy # = 3
12
Results of Linkage Disequilibrium
around bi-allelic CNVs

51% of CNVs in non-African populations have
tag SNPs, whereas only 22% of CNVs in
African population can be tagged.
Duplications would generate linkage disequilibrium at
acceptor locus instead of donor locus.
 The Phase I HapMap project has a paucity of SNPs
in segmental-duplicated regions, where their CNVs
are enriched.
 Given false-positive CNVs inside and the uncertainty
of CNV boundary, these results are bias (Conrad et al,
13
Nat. Genet., 2006).

Linkage Disequilibrium around multiallelic CNVs

Linkage disequilibrium between multi-allelic CNVs and
each flanking SNP are computed by square of Pearson’s
correlation coefficient.


No SNPs with strong linkage disequilibrium are found.
Mistakes in comparing bi-allelic SNP with multi-allelic CNV.
SNP1
C
A
CNV1
Copy # = 0
Copy # = 1
SNP2
C
G
SNP1
C or A
C
C
A
Copy # = 2
Copy # = 3
Copy # = 1
C
G
C
Copy # = 0, 1, 2, or 3
14
Lunch Break - Method
Intensity preprocessing
CNV detection
Copy number inference
15
Intensity Preprocessing

The signal intensity could be skewed due to
length of restriction enzyme fragment,
 GC content of the probe sequence,
 GC content of the restriction fragment, or
 Affinity differences of different SNP genotypes (e.g.,
AA, AC, CC).


Probe selection, noise reduction, and
normalization are done at this stage (Komura
et al, In Silico Biology, 2006).
16
CNV Detection
Log2 intensity ratio

For each pair of samples, we can test the
relative intensity ratio at each SNP position.
2
Relative gain of copy
1
No copy number change
0
-1
-2
Relative loss of copy
1 2 3 4 5 6 7 8 9 10
…
SNP position
45
…
65 …
17
CNV Detection
Log2 intensity ratio

CNV is detected by finding clusters of
sufficiently high (or low) ratios.
2
1
0
-1
-2
1 2 3 4 5 6 7 8 9 10
…
SNP position
45
…
65 …
18
CNV Detection

The intensity ratios at all SNPs can be
regarded as a sequence of real numbers.
We seek for a consecutive subsequence of
maximum sum.
Log2 intensity ratio

SNP position
0, 0.54, 1.21, 0.26, 2.34, …, 0, 0.1, -1.43, -0.2, …, -2.4, -2.6, -1.83 19
CNV Detection

A dynamic programming algorithm called SWArray is used to find the subsequence (NAR, 2005).

This algorithm has been proposed by Bentley in 1984.
S (i  1)
S (i)  Pi  max 
 0
0, 0.54, 1.21, 0.26, 2.34, …, 0, 0.1, -1.43, -0.2, …, -2.4, -2.6, -1.83
P1
P2
P3
P4
P5
…
20
Copy Number Inference

These clusters implies a putative CNV.
Log2 intensity ratio

But we still don’t know the exact copy number.
2
1
0
-1
-2
1 2 3 4 5 6 7 8 9 10
…
SNP position
45
…
65 …
21
Pairwise Comparison for All Samples

The above algorithm is repeated for each pair
of samples.
Sample a / Sample b
22
Copy Number Inference

The largest group of samples with the same
copy number is called a diploid group.
This diploid group is used as a reference
representing two copies.
 They assume the mutation events are rare, and thus
two copies should present highest frequency in the
population.

23
Steps of Copy Number Inference
24
Copy Number Inference

Samples c, d, and e are the largest group.
25
Copy Number Inference

The copy numbers of samples a and b are
inferred by comparing its intensity ratio with
the average ratio of the diploid group.
26
Concluding Remarks

The authors identify 1,447 CNVs using whole
genome tiling and SNP genotyping arrays.
Given the low resolution of their arrays and flawed
methods, I believe JJ’s results should be much more
promising.
 Linkage disequilibrium between CNVs and SNPs
requires more sophisticated statistics and algorithms.

27