Download Copy Number Variation (CNV)

Global Variation in Copy Number in the Human Genome Nature, Genome Research, Genome Research, 2006. Speaker: Yao-Ting Huang References      Redon et al. Global variation in copy number in the human genome. Nature, 2006. Fiegler et al. Accurate and reliable high-throughput detection of copy number variation in the human genome. Genome Research, 2006. Komura et al. Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Research, 2006. Komura et al. Noise Reduction from genotyping microarrays using probe level information. In Silico Biology, 2006. Price et al. SW-Array: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. NAR, 2005. 2 Copy Number Variation (CNV)  Copy Number Variation (CNV) is a DNA segment with length at least 1kb and  presents at variable copy number compared with a reference genome.   The cause of a CNV is speculated due to nonallelic homologous recombination.  CNVs may disrupt genes, alter gene dosage, and confer risk to complex diseases such as HIV-1. 3 Examples of CNVs (1) … Paternal Maternal Copy # = 2 Copy # = 2 Paternal Maternal Copy # = 2 Copy # = 2 A T A A T A C … Offspring Copy # = 1 Offspring Copy # = 3 Deletion Duplication 4 Examples of CNVs (2) Paternal Maternal Copy # = 3 Copy # = 3  Hard to tell the actual type of a CNV even within a family. Offspring Offspring Copy # = 2 Copy # = 4 Mendelian inheritance, deletion, duplication. 5 Use of Two Array Platforms  (1) Whole Genome TilePath array (93.7% of euchromatin); (2) Affymetrix 500K SNP array. 6 Results  There are a total of 1,447 CNVs identified and merged from these two arrays. 913 CNVs from tiling array and 980 CNVs from SNP genotyping array.  These CNVs cover 360Mb (12%) of the human genome.   The mean sizes of CNVs are 341kb in tiling array and 206 kb in SNP array.  The use of large insert clones (~170kb) on tiling array tends to overestimate the size of CNV. 7 Strength and Weakness of these Two Arrays  The 500k SNP array is better for detecting smaller CNVs.  The tiling array has more power than SNP array in segmental-duplicated region. 8 Location of CNVs  CNVs are preferentially located outside of genes and ultra-conserved elements. Types of Sequences WGTP CNVs 500K CNVs RefSeq (~25,000 genes) 2,561 1,139 OMIN (1,961 genes) 251 112 48 16 116,678 59,397 Ultra-conserved elements (481 elements) Conserved non-coding elements 9 Other Results   48% of gaps in the human genome assembly are flanked or overlapped by CNVs. 24% of 1,447 CNVs are associated with segmental duplications.   A portion of segmental duplications are CNVs and thus will not be fixed in the human genome. 12% of 1,447 CNVs are validated by locusspecific quantitative assay (e.g., quantitative PCR). 10 Linkage Disequilibrium between biallelic CNVs and Tag SNPs  Linkage disequilibrium between bi-allelic CNVs and flanking SNPs can guide the selection of tag SNPs.   e.g, the copy number of CNV1 can be predicted by SNP2. A single SNP array is sufficient to detect both SNP and CNV. SNP1 C A C C A SNP2 A C A C A CNV1 Copy # = 2 Copy # = 3 Copy # = 2 Copy # = 3 Copy # = 2 SNP3 C G C G C SNP4 C C T C C 11 Linkage Disequilibrium between biallelic CNVs and Tag SNPs  Linkage disequilibrium between bi-allelic CNVs and flanking SNPs can guide the selection of tag SNPs.  e.g., Suppose SNP2 is selected as tag SNP. SNP1 C A C C A SNP2 A C A C A CNV1 Copy # = 2 Copy # = 3 Copy # = 2 Copy # = 3 Copy # = 2 SNP3 SNP SNP42 C CA G C C T G Copy C# = 2 C C SNP2 C Copy # = 3 12 Results of Linkage Disequilibrium around bi-allelic CNVs  51% of CNVs in non-African populations have tag SNPs, whereas only 22% of CNVs in African population can be tagged. Duplications would generate linkage disequilibrium at acceptor locus instead of donor locus.  The Phase I HapMap project has a paucity of SNPs in segmental-duplicated regions, where their CNVs are enriched.  Given false-positive CNVs inside and the uncertainty of CNV boundary, these results are bias (Conrad et al, 13 Nat. Genet., 2006).  Linkage Disequilibrium around multiallelic CNVs  Linkage disequilibrium between multi-allelic CNVs and each flanking SNP are computed by square of Pearson’s correlation coefficient.   No SNPs with strong linkage disequilibrium are found. Mistakes in comparing bi-allelic SNP with multi-allelic CNV. SNP1 C A CNV1 Copy # = 0 Copy # = 1 SNP2 C G SNP1 C or A C C A Copy # = 2 Copy # = 3 Copy # = 1 C G C Copy # = 0, 1, 2, or 3 14 Lunch Break - Method Intensity preprocessing CNV detection Copy number inference 15 Intensity Preprocessing  The signal intensity could be skewed due to length of restriction enzyme fragment,  GC content of the probe sequence,  GC content of the restriction fragment, or  Affinity differences of different SNP genotypes (e.g., AA, AC, CC).   Probe selection, noise reduction, and normalization are done at this stage (Komura et al, In Silico Biology, 2006). 16 CNV Detection Log2 intensity ratio  For each pair of samples, we can test the relative intensity ratio at each SNP position. 2 Relative gain of copy 1 No copy number change 0 -1 -2 Relative loss of copy 1 2 3 4 5 6 7 8 9 10 … SNP position 45 … 65 … 17 CNV Detection Log2 intensity ratio  CNV is detected by finding clusters of sufficiently high (or low) ratios. 2 1 0 -1 -2 1 2 3 4 5 6 7 8 9 10 … SNP position 45 … 65 … 18 CNV Detection  The intensity ratios at all SNPs can be regarded as a sequence of real numbers. We seek for a consecutive subsequence of maximum sum. Log2 intensity ratio  SNP position 0, 0.54, 1.21, 0.26, 2.34, …, 0, 0.1, -1.43, -0.2, …, -2.4, -2.6, -1.83 19 CNV Detection  A dynamic programming algorithm called SWArray is used to find the subsequence (NAR, 2005).  This algorithm has been proposed by Bentley in 1984. S (i  1) S (i)  Pi  max   0 0, 0.54, 1.21, 0.26, 2.34, …, 0, 0.1, -1.43, -0.2, …, -2.4, -2.6, -1.83 P1 P2 P3 P4 P5 … 20 Copy Number Inference  These clusters implies a putative CNV. Log2 intensity ratio  But we still don’t know the exact copy number. 2 1 0 -1 -2 1 2 3 4 5 6 7 8 9 10 … SNP position 45 … 65 … 21 Pairwise Comparison for All Samples  The above algorithm is repeated for each pair of samples. Sample a / Sample b 22 Copy Number Inference  The largest group of samples with the same copy number is called a diploid group. This diploid group is used as a reference representing two copies.  They assume the mutation events are rare, and thus two copies should present highest frequency in the population.  23 Steps of Copy Number Inference 24 Copy Number Inference  Samples c, d, and e are the largest group. 25 Copy Number Inference  The copy numbers of samples a and b are inferred by comparing its intensity ratio with the average ratio of the diploid group. 26 Concluding Remarks  The authors identify 1,447 CNVs using whole genome tiling and SNP genotyping arrays. Given the low resolution of their arrays and flawed methods, I believe JJ’s results should be much more promising.  Linkage disequilibrium between CNVs and SNPs requires more sophisticated statistics and algorithms.  27

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Copy Number Variation (CNV)