Download ppt - Bayesian Gene Expression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Non-coding DNA wikipedia , lookup

Genomic imprinting wikipedia , lookup

RNA-Seq wikipedia , lookup

Point mutation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Metagenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Y chromosome wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Oncogenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ploidy wikipedia , lookup

Genomics wikipedia , lookup

DiGeorge syndrome wikipedia , lookup

Genomic library wikipedia , lookup

X-inactivation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Polyploid wikipedia , lookup

Chromosome wikipedia , lookup

Neocentromere wikipedia , lookup

Karyotype wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Transcript
Modelling of CGH arrays
experiments
• Philippe Broët
Faculté de Médecine,
Université de Paris-XI
• Sylvia Richardson
Imperial College
London
CGH = Competitive Genomic Hybridization
1
Outline
• Background
• Mixture model with spatial allocations
• Performance, comparison with CGHMiner
• Analyses of CGH-array cancer data sets
• Extensions
2
Aim: study genomic alterations in oncology
Loss
Tumor supressor gene
Gain
Oncogene
The development of solid tumors is associated with the
acquisition of complex genetic alterations that modify normal
cell growth and survival.
Many of these changes involve gains and/or losses of parts of
the genome: Amplification of an oncogene or deletion of a
tumor suppressor gene are considered as important
mechanisms for tumorigenesis.
3
CGH = Competitive Genomic hybridization
• Array containing short sequences of DNA bound to
glass slide
• Fluorescein-labeled normal and pathologic samples
co-hybridised to the array
Case
Control
1. Extraction
- DNA
2. Labelling (fluo)
3. Co-hybridization
4. Scanning
4
• Once hybridization has been performed, the signal
intensities of the fluorophores is quantified
Provides a means to quantitatively measure
DNA copy-number alterations and to map them
directly onto genomic sequence
5
MCF7 cell line investigated in Pollack et al (2002)
23 chromosomes and 6691 cDNA sequences
Data log transformed: Difference bet. MCF7 and reference
6
Types of alterations observed
• (Single) Gain or Deletion of sequences,
occurring for contiguous regions
Low level changes in the ratio ± log2
but attenuation (dye bias)  ratio ≈ ± 0.4
• Multiple gains (small regions)
High level change, easy to pick up
Focus the modelling on the first
common type of alterations
7
Chromosome 1
Deletion?
Multiple gains ?
Normal?
8
2 -- Mixture model
9
Specificity of CGH array experiment
A priori biological knowledge from conventional CGH :
• Limited number of states for a genomic sequence :
- presence (modal), - deletion, - gain(s)
corresponding to different intensity ratios on the array
Mixture model to capture the underlying discrete states
• GS located contiguously on chromosomes are likely to
carry alterations of the same type
Use clone spatial location in the allocation model
3 component mixture model with spatial allocation
10
Mixture model
For chromosome k:
Zgk : log ratio of measurement of normal versus tumoral change,
genomic sequence (GS) g, chromosome k
Dye bias is estimated by using a reference array (normal/normal)
and then subtracting the bias from Zgk
Zgk  w1gkN(μ1 ,12) + w2gkN(μ2 ,22) + w3gkN(μ3 ,32)
1=deletion
2=presence
3=gain
For unique labelling:
μ1 < 0 , μ3 > 0
μ2 = 0 (dye bias has been adjusted)
11
Mixture model with spatial allocation
Zgk  w1gkN(μ1 ,12) + w2gkN(μ2 ,22) + w3gkN(μ3 ,32)
Spatial structure on the weights (c.f. Fernandez and Green, 2002):
• Introduce 3 centred Markov random fields {umgk}, m = 1, 2, 3
with nearest neighbours along the chromosomes
Spatial neighbours of GS g
x
x x
g -1 g g+1
• Define mixture proportions to depend on the chromosomic
location via a logistic model:
wcgk = exp(ucgk) / Σm exp(umgk)
favours allocation of nearby GS to same component
12
Prior structure
• wcgk = exp(ucgk) / Σm exp(umgk)
with Gaussian Conditional AutoRegressive model :
ucgk | uc-gk ~ N (h uc hk /ng , sck2/ng)
for h = neighbour of g (ng = #h, one or two in this simple
case), with constraint g uc gk = 0
•
•
Variance parameters sck2 of the CAR acts as a
smoothing prior:  indexed by the chromosome :
‘switching structure’ between the states can be
different between chromosomes
Mean and variances (μc ,c2 ) of the mixture components
are common to all chromosomes  borrowing information
• Inverse gamma priors for the variances, uniform priors for
the means
13
Posterior quantities of interest
• Bayesian inference via MCMC, implemented using Winbugs
• In particular, latent allocations, Lgk , of GS g on chromosome
k to state c, are sampled during the MCMC run
• Compute posterior allocation probabilities :
pcgk= P(Lgk = c | data), c =1,2,3
• Probabilistic classification of each GS using threshold
on pcgk :
-- Assign g to modified state: deletion (c=1) or gain
(c=3) if corresponding pcgk > 0.8,
-- Otherwise allocate to modal state.
Subset S of genomic sequences classified as modified
(this subset depends on the chosen threshold)
14
False Discovery Rate
• Using the posterior allocation
probabilities, can compute an estimate of
FDR for the list S :
• Bayes FDR (S) | data = 1/card(S) Σg  S p2gk
where p2gk is posterior probability of allocation to
the modal (c=2) state
Note: Can adjust the threshold to get a desired
FDR and vice versa
15
3 -- Performance
16
Simulation set-up
• 200 fake GS with
Z ~ N(0 ,.32) , modal
Z ~ N(log 2 ,.32) , deletion, a block of 30 GS
Z ~ N(- log 2 ,.32), gains, blocks of 20 and 10 GS
• Reference array with Z ~ N(0 ,.32)
• 50 replications
Modal
Deletion
30
Modal
Gain Mod Gain Modal
20
10
17
CGH-Miner
•
Data mining approach to select gain and losses
(Wang et al 2005):
–
–
–
Hierarchical clustering with a spatial constraint
(ie only spatially adjacent clusters are joined)
Subtree selection according to predefined rules
 focus on selecting large consistent gain/loss
regions and small (big spike) regions
Implemented in CGH-Miner Excel plug in
•
Estimation of FDR using a reference
(normal/normal) array and the same set of rules
to prune the tree. Declared target 1%
•
Simulation set-up is similar to Wang et al.
18
Classification obtained by CGH miner and CGH mix
Modal
Deletion
30
Modal
Gain Mod Gain Modal
20
10
19
Posterior probabilities of allocation to the 3 components
20
Comparative performance between CGHmix
and CGH-Miner
50 simulations
Realised false
positive (mean)
Realised false
positive (range)
CGHmix
CGH-Miner
1.9
16.4
0 -- 20
3 -- 39
1.0
9.6
0 -- 4
0 -- 50
Realised FDR (%)
2.8
23.7
Estimated FDR (%)
1.3
1.2
Realised false
negative (mean)
Realised false
negative (range)
21
4 -- Analyses of CGH-array
cancer data sets
22
Breast cancer cell line MCF7
• Data from Pollack et al., 6691 GS on 23
chromosomes
•
•
•
•
•
^μ1 = -0.35, ^1 = 0.37
^2 = 0.27
(μ2 = 0)

^μ3 = 0.44, ^3 = 0.54
Estimated FDR CGHmix = 2.6%
Estimated FDR CGH-Miner = 1.5%
23
24
Classification of GS obtained by CGHmix
25
known
alterations
found by
both
methods
additional
known
Alterations
found by
CGHmix
26
Neuroblastoma KCNR cell line
Curie Institute CGH custom array
for chromosome 1
•
•
•
•
•
190 genomic clones, mostly on the short arm
3 replicate spots for each
μ1 = - 0.49, loss component
^
μ3 = 0.04, not plausible  no gain in this case
^
Estimate
FDR by regrouping c=2 and c=3
classes
• Substantial number of deletions on short arm
• No deletion found for the long arm by CGHmix,
a result confirmed by classical cytogenetic
information
27
Long arm
28
Extensions
• Account for variability in the case of repeated
measurement
 add a measurement model with GS
specific noise, with exchangeable prior
• Refine the spatial model:
– Incorporate genomic sequence location in the
neighbourhood definition of the CAR model
0-1 contiguity  spatial weights
– In particular, account for overlapping sequences
by using weights that depend on the overlap
29