Download DNA Copy Number Analysis (SGF talk 2007-02-12)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cre-Lox recombination wikipedia , lookup

Molecular evolution wikipedia , lookup

Genome evolution wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomic library wikipedia , lookup

RNA-Seq wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Community fingerprinting wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Transcript
Copy Number Analysis in the Cancer Genome
Using SNP Arrays
Qunyuan Zhang, Aldi Kraja
Division of Statistical Genomics
Department of Genetics & Center for Genome Sciences
Washington University School of Medicine
Statistical Genetics Forum
02 - 12 - 2007
What is Copy Number ?
 Gene Copy Number
The gene copy number (also "copy number variants" or CNVs)
is the amount of copies of a particular gene in the genotype of
an individual. Recent evidence shows that the gene copy
number can be elevated in cancer cells... (from Wikipedia
www.wikipedia.org)
 DNA Copy Number
A Copy Number Variant (CNV) represents a copy number
change involving a DNA fragment that is ~1 kilobases or
larger. (from Nature Reviews Genetics, Feuk et al. 2006)
 Chromosomal Copy Number
 It refers to DNA Copy Number in most publications.
Why Study Copy Number ?
“ Chromosomal copy number alterations can lead to activation of
oncogenes and inactivation of tumor suppressor genes (TSGs) in
human cancers. … identification of cancer-specific copy number
alterations will not only provide new insight into understanding the
molecular basis of tumorigenesis but will also facilitate the
discovery of new TSGs and oncogenes.”
DNA Copy Number Changes in Tumor Cells
Normal cell
CN=2
Homologous repeats
Segmental duplications
Chromosomal rearrangements
Duplicative transpositions
Non-allelic recombinations
……
Tumor cells
deletion
CN=0
amplification
CN=1
CN=2
CN=3
CN=4
Why Use SNP Arrays ?
 CGH Array
CGH: Comparative genomic hybridization
“Array-based CGH makes it possible to scan the genome for copy number with high
resolution by hybridizing to arrayed genomic DNA or cDNA clones.
…However, currently available array CGH methods cannot simultaneously
detect chromosomal loss of heterozygosity (LOH). “
 SNP Array
“… to combine the detection of cancer copy number with cancer-specific LOH in
the same experiments, we have developed an analytical method to
detect DNA copy number changes by hybridization of representations
of genomic DNA to commercially available single nucleotide polymorphism
(SNP) arrays.”
Simultanously detect DNA copy number changes and phenotype changes (LOH)
in tumor cells
Materials & Methods
5 samples for validation, with known copy numbers of chromosome X
(1,2,3,4,5 copies of chrom. X)
2 diploid cell lines containing cytogenetically mapped partial or whole-chromosome
copy number gains or losses.
18 lung and breast cancer cell lines
15 normal blood control cell lines
Affymetrix XbaI mapping array 130 (10,043 SNPs)
Chip scanning and image processing by MAS 5.0
Intensity normalization and summarization
Raw/observed copy numbers of cancer samples
Segmentation and copy number estimation (Hidden Markov Model, HMM)
Normalization & Summarization
 Normalization (reducing technical variation between chips, making intensities
from different chips comparable)
- Base Line Array Method
 Summarization (combining the multiple probe intensities for each SNP to
produce a summarized signal value for each SNP)
Perfect Match: pm = pmA + pmB
Mismach:
mm = mmA + mmB
Model based summarization
pm/mm difference multiplicative model (Li & Wong , 2001)
Observed/Raw Copy Number Data
For each SNP of each cancer sample
observed signal
Observed CN =
x 2
mean signal of two copy normal samples
Log2 Transformed Intensities and Raw CNs
Black: Normal,
Red: Tumor,
Green: Tumor/Normal
Segmentation & Estimation
CN=4
CN=2
CN=3
CN=1
CN Estimation: Hidden Markov Model (HMM)
CNAT(www.affymetrix.com); dChip (www.dchip.org) ; CNAG (www.genome.umin.jp)
…
SNP
SNP_i
Hidden status
(unknown CN )
Observed status
(observed/raw CN)
SNP_i+1
SNP_i+2
SNP_i+3
SNP_i+4
CN=?
CN=?
CN=?
CN=?
CN=?
Obs.
CN
Obs.
CN
Obs.
CN
Obs.
CN
Obs.
CN
…
CN estimation: finding a sequence of CN values which maximizes the likelihood of observed raw CN.
Algorithm:
Viterbi algorithm
Information/assumptions below are needed
Background probabilities: Overall probabilities of possible CN values.
P(CN=x); x=0,1,2,3,… n (usually,n<10)
Transition probabilities: Probabilities of CN values of each SNP conditional on the previous one.
P(CN_i+1=x|CN_i=y); x=0,1,2,3,… n; y=0,1,2,3, … n
Emission probabilities: Probabilities of observed raw CN values of each SNP conditional on the
hidden/unknown/true CN status.
P( observed CN | CN=y)
y=0,1,2,3, …n
Prior Information for HMM
Background Probabilities (B)
Overall probabilities of possible CN values.
P(CN=2)=0.9
P(CN=i)=0.1/(N-1), i=0,1,3,4,…,N; N=max CN allowed. e.g. P(CN=i)=0.01 when N=11
Transition Probabilities (T)
Probabilities of CN values of each SNP conditional on the previous one.
P(CN_i+1=x|CN_i=y); x=-0,1,2,3,… n; y=0,1,2,3, … n
0
p00
p10
p20
p30
1
p01
p11
p21
p31
2
p02
p12
p22
p32
3 … n
p03 … p0n
p13 … p1n
p23 … p2n
p33 … p0n
0
1
2
3
…
n pn0 pn1 pn2 pn3 … pnn
Genetic distance (Haldane map funtion)
Emission Probabilities (E)
Probabilities of observed raw CN values of each SNP conditional on the hidden/unknown/true CN
status.
Signal |CN ~ t distribution with df=40
Max Liklihood (Observed CN | B, T, E);
Interative
HMM CN estimation for the samples with known CN of Chr. X
Errors of HMM (1-99.2%=0.8%)
“… our criteria for homozygous deletion require the presence of at least 2 SNPs that cover an
area of 1 kb in addition to an inferred copy number of 0 …”
HMM CN estimation for the samples with known loss or gain regions
HMM CN estimation for cancer cell lines
Contamination Problem
Disadvantages of HMM
• With no significance test
• Intense computation
• Individual level analysis
Software
Affymetrix Chips
Illumina Chips
(www.affymetrix.com)
(www.illumina.com)
CNAT(www.affymetrix.com)
dChip (www.dchip.org)
CNAG (www.genome.umin.jp)
GenePattern www.broad.mit.edu/cancer/software/genepattern/
BioConductor R Packages (www.bioconductor.org)
GLAD package, adaptive weights smoothing (AWS) method
DNAcopy package, circular binary segmentation method
References
•
JL Freeman et al. Genome Research 2006; 16:949-961
•
J Huang et al. Hum Genomics. 2004;1(4):287-99
•
X Zhao et al. Cancer Research 2004; 64:3060-3071
•
Y Nannya et al. Cancer Research 2005, 65: 6071-6079
•
… see google …
Genome-wide Raw CN Changes (Piar#105)
Genome-wide Raw CN Changes
(average over ~400 pairs )
Raw CN Changes of Chr. 14
(average over ~400 pairs )
Sliding Window Analysis
… .. … … . . . . .. …… …… .. … … . . . . .. …… … .. …… … ..
Window 1
Window 2
Window 3
Window 4
Window 5
Window 6
Window 7
Window 8
Window 9
Window 10
………..
Window k
………..
Window N
Each window (k) contains 30 consecutive SNPs (k, k+1, k+2, k+3, …, k+29)
Genome-wide Raw Copy Number Changes
(sliding window plot, averaged over ~400 pairs )
Sliding Window Test of Significance of CN Changes
-log(p) values, based on ~ 400 pairs
CN Change Frequencies in Population ( Chr.14,~400 pairs)
Black: Freq.(CN>0)
Red: Freq.(CN>0, significant amplification at 0.01 level)
Green: Freq.(CN<0, significant deletion at 0.01 level)
 Microarray: From Image to Copy Number
Tumor
Affymetrix Mapping
250K Sty-I chip
Normal
~250K probe sets
~250K SNPs
probe set (24 probes)
CN=1
Deletion
CN=2
Deletion
CN=2
CN=0
CN>2
more DNA copy number
Amplification
more DNA hybridization
CN=2
higher intensity
 General Procedures for Copy Number Analysis
Finished chips
(scanner)
Raw image data [.DAT files]
(experiment info [ .EXP])
chip description file
[.CDF]
Preprocessing :
(image processing software)
Probe level raw intensity data [.CEL files]
Background adjustment, Normalization, Summarization
Summarized intensity data
Raw copy number (CN) data [log ratio of tumor/normal intensities]
Significance test of CN changes
Estimation of CN
Smoothing and boundary determination
Concurrent regions among population
Amplification and deletion frequencies among populations
Association analysis