Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Global Variation in Copy Number in the Human Genome Nature, Genome Research, Genome Research, 2006. Speaker: Yao-Ting Huang References Redon et al. Global variation in copy number in the human genome. Nature, 2006. Fiegler et al. Accurate and reliable high-throughput detection of copy number variation in the human genome. Genome Research, 2006. Komura et al. Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Research, 2006. Komura et al. Noise Reduction from genotyping microarrays using probe level information. In Silico Biology, 2006. Price et al. SW-Array: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. NAR, 2005. 2 Copy Number Variation (CNV) Copy Number Variation (CNV) is a DNA segment with length at least 1kb and presents at variable copy number compared with a reference genome. The cause of a CNV is speculated due to nonallelic homologous recombination. CNVs may disrupt genes, alter gene dosage, and confer risk to complex diseases such as HIV-1. 3 Examples of CNVs (1) … Paternal Maternal Copy # = 2 Copy # = 2 Paternal Maternal Copy # = 2 Copy # = 2 A T A A T A C … Offspring Copy # = 1 Offspring Copy # = 3 Deletion Duplication 4 Examples of CNVs (2) Paternal Maternal Copy # = 3 Copy # = 3 Hard to tell the actual type of a CNV even within a family. Offspring Offspring Copy # = 2 Copy # = 4 Mendelian inheritance, deletion, duplication. 5 Use of Two Array Platforms (1) Whole Genome TilePath array (93.7% of euchromatin); (2) Affymetrix 500K SNP array. 6 Results There are a total of 1,447 CNVs identified and merged from these two arrays. 913 CNVs from tiling array and 980 CNVs from SNP genotyping array. These CNVs cover 360Mb (12%) of the human genome. The mean sizes of CNVs are 341kb in tiling array and 206 kb in SNP array. The use of large insert clones (~170kb) on tiling array tends to overestimate the size of CNV. 7 Strength and Weakness of these Two Arrays The 500k SNP array is better for detecting smaller CNVs. The tiling array has more power than SNP array in segmental-duplicated region. 8 Location of CNVs CNVs are preferentially located outside of genes and ultra-conserved elements. Types of Sequences WGTP CNVs 500K CNVs RefSeq (~25,000 genes) 2,561 1,139 OMIN (1,961 genes) 251 112 48 16 116,678 59,397 Ultra-conserved elements (481 elements) Conserved non-coding elements 9 Other Results 48% of gaps in the human genome assembly are flanked or overlapped by CNVs. 24% of 1,447 CNVs are associated with segmental duplications. A portion of segmental duplications are CNVs and thus will not be fixed in the human genome. 12% of 1,447 CNVs are validated by locusspecific quantitative assay (e.g., quantitative PCR). 10 Linkage Disequilibrium between biallelic CNVs and Tag SNPs Linkage disequilibrium between bi-allelic CNVs and flanking SNPs can guide the selection of tag SNPs. e.g, the copy number of CNV1 can be predicted by SNP2. A single SNP array is sufficient to detect both SNP and CNV. SNP1 C A C C A SNP2 A C A C A CNV1 Copy # = 2 Copy # = 3 Copy # = 2 Copy # = 3 Copy # = 2 SNP3 C G C G C SNP4 C C T C C 11 Linkage Disequilibrium between biallelic CNVs and Tag SNPs Linkage disequilibrium between bi-allelic CNVs and flanking SNPs can guide the selection of tag SNPs. e.g., Suppose SNP2 is selected as tag SNP. SNP1 C A C C A SNP2 A C A C A CNV1 Copy # = 2 Copy # = 3 Copy # = 2 Copy # = 3 Copy # = 2 SNP3 SNP SNP42 C CA G C C T G Copy C# = 2 C C SNP2 C Copy # = 3 12 Results of Linkage Disequilibrium around bi-allelic CNVs 51% of CNVs in non-African populations have tag SNPs, whereas only 22% of CNVs in African population can be tagged. Duplications would generate linkage disequilibrium at acceptor locus instead of donor locus. The Phase I HapMap project has a paucity of SNPs in segmental-duplicated regions, where their CNVs are enriched. Given false-positive CNVs inside and the uncertainty of CNV boundary, these results are bias (Conrad et al, 13 Nat. Genet., 2006). Linkage Disequilibrium around multiallelic CNVs Linkage disequilibrium between multi-allelic CNVs and each flanking SNP are computed by square of Pearson’s correlation coefficient. No SNPs with strong linkage disequilibrium are found. Mistakes in comparing bi-allelic SNP with multi-allelic CNV. SNP1 C A CNV1 Copy # = 0 Copy # = 1 SNP2 C G SNP1 C or A C C A Copy # = 2 Copy # = 3 Copy # = 1 C G C Copy # = 0, 1, 2, or 3 14 Lunch Break - Method Intensity preprocessing CNV detection Copy number inference 15 Intensity Preprocessing The signal intensity could be skewed due to length of restriction enzyme fragment, GC content of the probe sequence, GC content of the restriction fragment, or Affinity differences of different SNP genotypes (e.g., AA, AC, CC). Probe selection, noise reduction, and normalization are done at this stage (Komura et al, In Silico Biology, 2006). 16 CNV Detection Log2 intensity ratio For each pair of samples, we can test the relative intensity ratio at each SNP position. 2 Relative gain of copy 1 No copy number change 0 -1 -2 Relative loss of copy 1 2 3 4 5 6 7 8 9 10 … SNP position 45 … 65 … 17 CNV Detection Log2 intensity ratio CNV is detected by finding clusters of sufficiently high (or low) ratios. 2 1 0 -1 -2 1 2 3 4 5 6 7 8 9 10 … SNP position 45 … 65 … 18 CNV Detection The intensity ratios at all SNPs can be regarded as a sequence of real numbers. We seek for a consecutive subsequence of maximum sum. Log2 intensity ratio SNP position 0, 0.54, 1.21, 0.26, 2.34, …, 0, 0.1, -1.43, -0.2, …, -2.4, -2.6, -1.83 19 CNV Detection A dynamic programming algorithm called SWArray is used to find the subsequence (NAR, 2005). This algorithm has been proposed by Bentley in 1984. S (i 1) S (i) Pi max 0 0, 0.54, 1.21, 0.26, 2.34, …, 0, 0.1, -1.43, -0.2, …, -2.4, -2.6, -1.83 P1 P2 P3 P4 P5 … 20 Copy Number Inference These clusters implies a putative CNV. Log2 intensity ratio But we still don’t know the exact copy number. 2 1 0 -1 -2 1 2 3 4 5 6 7 8 9 10 … SNP position 45 … 65 … 21 Pairwise Comparison for All Samples The above algorithm is repeated for each pair of samples. Sample a / Sample b 22 Copy Number Inference The largest group of samples with the same copy number is called a diploid group. This diploid group is used as a reference representing two copies. They assume the mutation events are rare, and thus two copies should present highest frequency in the population. 23 Steps of Copy Number Inference 24 Copy Number Inference Samples c, d, and e are the largest group. 25 Copy Number Inference The copy numbers of samples a and b are inferred by comparing its intensity ratio with the average ratio of the diploid group. 26 Concluding Remarks The authors identify 1,447 CNVs using whole genome tiling and SNP genotyping arrays. Given the low resolution of their arrays and flawed methods, I believe JJ’s results should be much more promising. Linkage disequilibrium between CNVs and SNPs requires more sophisticated statistics and algorithms. 27