Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cre-Lox recombination wikipedia , lookup
Molecular evolution wikipedia , lookup
Genome evolution wikipedia , lookup
Non-coding DNA wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genomic library wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Copy Number Analysis in the Cancer Genome Using SNP Arrays Qunyuan Zhang, Aldi Kraja Division of Statistical Genomics Department of Genetics & Center for Genome Sciences Washington University School of Medicine Statistical Genetics Forum 02 - 12 - 2007 What is Copy Number ? Gene Copy Number The gene copy number (also "copy number variants" or CNVs) is the amount of copies of a particular gene in the genotype of an individual. Recent evidence shows that the gene copy number can be elevated in cancer cells... (from Wikipedia www.wikipedia.org) DNA Copy Number A Copy Number Variant (CNV) represents a copy number change involving a DNA fragment that is ~1 kilobases or larger. (from Nature Reviews Genetics, Feuk et al. 2006) Chromosomal Copy Number It refers to DNA Copy Number in most publications. Why Study Copy Number ? “ Chromosomal copy number alterations can lead to activation of oncogenes and inactivation of tumor suppressor genes (TSGs) in human cancers. … identification of cancer-specific copy number alterations will not only provide new insight into understanding the molecular basis of tumorigenesis but will also facilitate the discovery of new TSGs and oncogenes.” DNA Copy Number Changes in Tumor Cells Normal cell CN=2 Homologous repeats Segmental duplications Chromosomal rearrangements Duplicative transpositions Non-allelic recombinations …… Tumor cells deletion CN=0 amplification CN=1 CN=2 CN=3 CN=4 Why Use SNP Arrays ? CGH Array CGH: Comparative genomic hybridization “Array-based CGH makes it possible to scan the genome for copy number with high resolution by hybridizing to arrayed genomic DNA or cDNA clones. …However, currently available array CGH methods cannot simultaneously detect chromosomal loss of heterozygosity (LOH). “ SNP Array “… to combine the detection of cancer copy number with cancer-specific LOH in the same experiments, we have developed an analytical method to detect DNA copy number changes by hybridization of representations of genomic DNA to commercially available single nucleotide polymorphism (SNP) arrays.” Simultanously detect DNA copy number changes and phenotype changes (LOH) in tumor cells Materials & Methods 5 samples for validation, with known copy numbers of chromosome X (1,2,3,4,5 copies of chrom. X) 2 diploid cell lines containing cytogenetically mapped partial or whole-chromosome copy number gains or losses. 18 lung and breast cancer cell lines 15 normal blood control cell lines Affymetrix XbaI mapping array 130 (10,043 SNPs) Chip scanning and image processing by MAS 5.0 Intensity normalization and summarization Raw/observed copy numbers of cancer samples Segmentation and copy number estimation (Hidden Markov Model, HMM) Normalization & Summarization Normalization (reducing technical variation between chips, making intensities from different chips comparable) - Base Line Array Method Summarization (combining the multiple probe intensities for each SNP to produce a summarized signal value for each SNP) Perfect Match: pm = pmA + pmB Mismach: mm = mmA + mmB Model based summarization pm/mm difference multiplicative model (Li & Wong , 2001) Observed/Raw Copy Number Data For each SNP of each cancer sample observed signal Observed CN = x 2 mean signal of two copy normal samples Log2 Transformed Intensities and Raw CNs Black: Normal, Red: Tumor, Green: Tumor/Normal Segmentation & Estimation CN=4 CN=2 CN=3 CN=1 CN Estimation: Hidden Markov Model (HMM) CNAT(www.affymetrix.com); dChip (www.dchip.org) ; CNAG (www.genome.umin.jp) … SNP SNP_i Hidden status (unknown CN ) Observed status (observed/raw CN) SNP_i+1 SNP_i+2 SNP_i+3 SNP_i+4 CN=? CN=? CN=? CN=? CN=? Obs. CN Obs. CN Obs. CN Obs. CN Obs. CN … CN estimation: finding a sequence of CN values which maximizes the likelihood of observed raw CN. Algorithm: Viterbi algorithm Information/assumptions below are needed Background probabilities: Overall probabilities of possible CN values. P(CN=x); x=0,1,2,3,… n (usually,n<10) Transition probabilities: Probabilities of CN values of each SNP conditional on the previous one. P(CN_i+1=x|CN_i=y); x=0,1,2,3,… n; y=0,1,2,3, … n Emission probabilities: Probabilities of observed raw CN values of each SNP conditional on the hidden/unknown/true CN status. P( observed CN | CN=y) y=0,1,2,3, …n Prior Information for HMM Background Probabilities (B) Overall probabilities of possible CN values. P(CN=2)=0.9 P(CN=i)=0.1/(N-1), i=0,1,3,4,…,N; N=max CN allowed. e.g. P(CN=i)=0.01 when N=11 Transition Probabilities (T) Probabilities of CN values of each SNP conditional on the previous one. P(CN_i+1=x|CN_i=y); x=-0,1,2,3,… n; y=0,1,2,3, … n 0 p00 p10 p20 p30 1 p01 p11 p21 p31 2 p02 p12 p22 p32 3 … n p03 … p0n p13 … p1n p23 … p2n p33 … p0n 0 1 2 3 … n pn0 pn1 pn2 pn3 … pnn Genetic distance (Haldane map funtion) Emission Probabilities (E) Probabilities of observed raw CN values of each SNP conditional on the hidden/unknown/true CN status. Signal |CN ~ t distribution with df=40 Max Liklihood (Observed CN | B, T, E); Interative HMM CN estimation for the samples with known CN of Chr. X Errors of HMM (1-99.2%=0.8%) “… our criteria for homozygous deletion require the presence of at least 2 SNPs that cover an area of 1 kb in addition to an inferred copy number of 0 …” HMM CN estimation for the samples with known loss or gain regions HMM CN estimation for cancer cell lines Contamination Problem Disadvantages of HMM • With no significance test • Intense computation • Individual level analysis Software Affymetrix Chips Illumina Chips (www.affymetrix.com) (www.illumina.com) CNAT(www.affymetrix.com) dChip (www.dchip.org) CNAG (www.genome.umin.jp) GenePattern www.broad.mit.edu/cancer/software/genepattern/ BioConductor R Packages (www.bioconductor.org) GLAD package, adaptive weights smoothing (AWS) method DNAcopy package, circular binary segmentation method References • JL Freeman et al. Genome Research 2006; 16:949-961 • J Huang et al. Hum Genomics. 2004;1(4):287-99 • X Zhao et al. Cancer Research 2004; 64:3060-3071 • Y Nannya et al. Cancer Research 2005, 65: 6071-6079 • … see google … Genome-wide Raw CN Changes (Piar#105) Genome-wide Raw CN Changes (average over ~400 pairs ) Raw CN Changes of Chr. 14 (average over ~400 pairs ) Sliding Window Analysis … .. … … . . . . .. …… …… .. … … . . . . .. …… … .. …… … .. Window 1 Window 2 Window 3 Window 4 Window 5 Window 6 Window 7 Window 8 Window 9 Window 10 ……….. Window k ……….. Window N Each window (k) contains 30 consecutive SNPs (k, k+1, k+2, k+3, …, k+29) Genome-wide Raw Copy Number Changes (sliding window plot, averaged over ~400 pairs ) Sliding Window Test of Significance of CN Changes -log(p) values, based on ~ 400 pairs CN Change Frequencies in Population ( Chr.14,~400 pairs) Black: Freq.(CN>0) Red: Freq.(CN>0, significant amplification at 0.01 level) Green: Freq.(CN<0, significant deletion at 0.01 level) Microarray: From Image to Copy Number Tumor Affymetrix Mapping 250K Sty-I chip Normal ~250K probe sets ~250K SNPs probe set (24 probes) CN=1 Deletion CN=2 Deletion CN=2 CN=0 CN>2 more DNA copy number Amplification more DNA hybridization CN=2 higher intensity General Procedures for Copy Number Analysis Finished chips (scanner) Raw image data [.DAT files] (experiment info [ .EXP]) chip description file [.CDF] Preprocessing : (image processing software) Probe level raw intensity data [.CEL files] Background adjustment, Normalization, Summarization Summarized intensity data Raw copy number (CN) data [log ratio of tumor/normal intensities] Significance test of CN changes Estimation of CN Smoothing and boundary determination Concurrent regions among population Amplification and deletion frequencies among populations Association analysis