Download (2) in ppt - NYU Computer Science

Document related concepts

SNP genotyping wikipedia , lookup

NEDD9 wikipedia , lookup

Metagenomics wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Human genetic variation wikipedia , lookup

Public health genomics wikipedia , lookup

RNA-Seq wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Pathogenomics wikipedia , lookup

Mutagen wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

NUMT wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome (book) wikipedia , lookup

Minimal genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomics wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genomic library wikipedia , lookup

Human genome wikipedia , lookup

Genome editing wikipedia , lookup

ENCODE wikipedia , lookup

Oncogenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Human Genome Project wikipedia , lookup

Transcript
(II)
Human Cancer
Genome
Project
Computational Systems Biology of Cancer:
Human Cancer
Genome
Project
Bud Mishra
Professor of Computer Science,
Mathematics and Cell Biology
¦
Courant Institute, NYU School of Medicine, Tata Institute of
Fundamental Research, and Mt. Sinai School of Medicine
Human Cancer
Genome
Project
Genome
Evolution
The New Synthesis
DNA
RNA
Transcription
Protein
Selection
Translation
Part-lists, Annotation, Ontologies
genetic
instability
•micro-environment •proteomic
•epigenomics
•metabolomics
•transcriptomics
•signaling
perturbed
pathways
Human Cancer
Genome
Project
Is the Genomic View of Cancer
Necessarily Accurate ?
• “If I said yes, that would then
suggest that that might be the only
place where it might be done which
would not be accurate, necessarily
accurate. It might also not be
inaccurate, but I'm disinclined to
mislead anyone.”
– US Secretary of Defense, Mr. Donald
Rumsfeld, Once again quoted
completely out of context.
Human Cancer
Genome
Project
Cancer Initiation and Progression
Genomics (Mutations, Translocations, Amplifications, Deletions)
Epigenomics (Hyper & Hypo-Methylation)
Transcriptomics (Alternate Splicing, mRNA)
Proteomics (Synthesis, Post-Translational Modification, Degradation)
Signaling
Cancer Initiation and Progression
Proliferation, Motility,
Immortality,
Metastasis, Signaling
Human Cancer
Genome
Project
Mishra’s Mystical 3M’s
• Rapid and accurate
solutions
Measure
Mine
Model
– Bioinformatic, statistical,
systems, and
computational
approaches.
– Approaches that are
scalable, agnostic to
technologies, and widely
applicable
• Promises, challenges
and obstacles—
Human Cancer
Genome
Project
“Measure”
What we can quantify and what we cannot
Human Cancer
Genome
Project
Microarray Analysis of Cancer
Genome
Normal DNA
• Representations are reproducible
samplings of DNA populations in
Tumor DNA
which the resulting DNA has a new
format and reduced complexity.
Normal LCR
Tumor LCR
Label
Hybridize
– We array probes derived from low
complexity representations of the
normal genome
– We measure differences in gene
copy number between samples
ratiometrically
– Since representations have a lower
nucleotide complexity than total
genomic DNA, we obtain a stronger
specific hybridization signal relative
to non-specific and noise
Human Cancer
Genome
Project
Minimizing Cross Hybridization
(Complexity Reduction)
Human Cancer
Genome
Project
Copy Number Fluctuation
A1
B1
C1
A2
B2
C2
A3
B3
C3
Critical Innovations
Human Cancer
Genome
Project
•
Data Normalization and Background Correction for Affy-Chips
– 10K, 100K, 500K (Affy); Generalized RMA
– Multi-Experiment-Based Probe-Characterization (Kalman + EM)
•
A novel genome segmenter algorithm
– Empirical Bayes Approach; Maximum A Posteriori (MAP)
– Generative Model (Hierarchical, Heteroskedastic)
– Dynamic Programming Solution
• Cubic-Time; Linear-time Approximation using Beam-Search Heuristic
•
Single Molecule Technologies
–
–
–
–
Optical and Nanotechnologies
Sequencing: SMASH
Epigenomics
Transcriptomics
Human Cancer
Genome
Project
Background Correction & Normalization
Human Cancer
Genome
Project
Oligo Arrays: SNP genotyping
• Given 500K human SNPs to be measured, select 10
25-mers that over lap each SNP location for Allele A.
DNA
25-mers
– Select another 10 25-mers corresponding to SNP Allele B.
– Problem : Cross Hybridization
Human Cancer
Genome
Project
Using SNP arrays to detect Genomic
Aberrations
• Each SNP “probeset” measures
absense/presence of one of two Alleles.
• If a region of DNA is deleted by cancer,
one or both alleles will be missing!
• If a region of DNA is
duplicated/amplified by cancer, one or
both alleles will be amplified.
• Problem : Oligo arrays are noisy.
Human Cancer
Genome
Project
90 humans, 1 SNP (A=0.48)
Allele B
Allele A
Human Cancer
Genome
Project
90 humans, 1 SNP (A=0.24)
Allele B
Allele A
Human Cancer
Genome
Project
90 humans, 1 SNP (A=0.96)
Allele B
Allele A
Human Cancer
Genome
Project
Background Correction & Normalization
• Consider a genomic location L and two “similar”
nucleotide sequences sL,x and sL,y starting at that
location in the two copies of a diploid genomes…
– E.g., they may differ in one SNP.
– Let qx and qy be their respective copy numbers in the whole
genome and all copies are selected in the reduced
complexity representation. The gene chip contains four
probes px 2 sL,x; py 2 sL,y; px’, py’ :2 G.
– After PCR amplification, we have some Kx ¢ qx amount of
DNA that is complementary to the probe px, etc.K' (¼ K’x)
amount of DNA that is additionally approximately
complementary to the probe px.
Human Cancer
Genome
Project
Normalize using a Generalized RMA
I’ = U - mn
– [a sn2 - fN(0,1)(a’/b’)/FN(0,1)(a’/b’)]
£{(1 + b’ Bsn/FN(0,1)(a’/b’)}-1
+ [bsn/Bsn] )]
£{(1 + FN(0,1)(a’/b’)/(b’ Bsn)}-1,
– Where a’ = U-mn -a sn2; b’ = sn, and
– bsn =  [Ii,j – U + mn] fN(0,1)([Ii,j – U + mn] )
– Bsn =  fN(0,1)([Ii,j – U + mn] )
Human Cancer
Genome
Project
Background Correction & Normalization
• If the probe has an affinity fx, then the
measured intensity is can be expressed as
[Kx qx + K’] fx +noise
= [qx + K’/Kx] f’x + noise
– With Exp[m1 + e s1], a multiplicative logNormal noise,
[m2 + e s2] an additive Gaussian noise,
and f’x = Kx fx an amplified affinity.
• A more general model:
Ix = [qx + K’/Kx] f’x em1+e s1 + m2 + e s2
Human Cancer
Genome
Project
Mathematical Model
• In particular, we have
measured intensities:
four
values
of
Ix = [qx f’x + Nx]e m1 +e s1 +m2 + e s2
Ix’ = [Nx] e m1 +e s1 +m2 + e s2
Iy = [qy f’y + Ny] e m1 +e s1 +m2 + e s2
Iy’ = [Ny] e m1 +e s1 +m2 + e s2
Human Cancer
Genome
Project
Bioinformatics: Data modeling
• Good news: For each 25-bp probe, the
fluorescent signal increases linearly with the
amount of complementary DNA in the sample
(up to some limit where it saturates).
• Bad news: The linear scaling and offset differ
for each 25-bp probe. Scaling varies by
factors of more than 10x.
• Noise : Due to PCR & cross hybridization and
measurement noise.
Human Cancer
Genome
Project
Scaling & Offset differ
• Scaling varies across probes:
– Each 25-bp sequence has different thermodynamic properties.
• Scaling varies across samples:
– The scanning laser for different samples may have different
levels.
– The starting DNA concentrations may differ; PCR may amplify
differently.
• Offset varies across probes:
– Different levels of Cross Hybridization with the rest of the
Genome.
• Offset varies across samples:
– Different sample genomes may differ slightly (sample
degradation; impurities, etc.)
Human Cancer
Genome
Project
Linear Model + Noise
i  sample
k  probe in probeset j
PM ik  Observed DNA level
q ik  True DNA level
PM ik  K i  N k + q ikfk e
es ik
+ Ci + e s ik
where
ε,ε are gaussian noise sources
σ ik , s ik are noise scaling factors
Human Cancer
Genome
Project
Noise minimization
Just estimate θik and parameters given PM ik
using Maximum Likelihood Estimate (MLE).
This is much simpler if we have only one noise
term. We can approximat e with a single
multiplica tive noise term :
PM ik  K i  N k + q ikfk + Fi e
es ik
+ Ci  K i Fi
Final Data Model
Human Cancer
Genome
Project
Ai ( PM ik + Bi )   N k + q ikfk + Fi e
e iks ik
where
s ik  si t k & q ik are the same for all probes k in the
same probeset j..
The correspond ing probabilit y density is :
PPM ik |   
e
e ik 2 / 2
PM ik + Bi 
2sik
2
Human Cancer
Genome
Project
MLE using gradients
Overall log likelihood (no priors) :
L   log( PM ik + Bi ) + log ( si t k ) +
i ,k


 Ai ( PM ik + Bi ) 
2 2
 / 2 si t k
log 
 N k + q ikfk + Fi 
For each parameter q  , gradient update :
2
L / q
q q  2
 L /  2q
Human Cancer
Genome
Project
Data Outliers
• Our data model fails for few data points
(“bad probes”)
– Soln (1): Improve the model…
– Soln (2): Discard the outliers
– Soln (3): Alternate model for the outliers…
Weight the data approprately.
Human Cancer
Genome
Project
Outlier Model
PPM ik   w1 P1 PM ik  + 1  w1 P2 PM ik 
where
P2 PM ik   Uniform Distributi on
w1  Prior probabilit y that data is NOT outlier.
Human Cancer
Genome
Project
Problem with MLE:
No unique maxima
The following have no effect on probabilit y :
1. Increase all Fi and decrease all N k by C.
2. In any probeset j : Increase q ik by N and
decrease N k by Nfk
3.Scale all Ai , N k , Fi , q ik by same factor C
4.Scale si and unscale t k by same factor C
5.In any probeset j : Scale fk and unscale
q ik by same factor C
Human Cancer
Genome
Project
Scaling of MLE estimate
The MLE estimate of θij must be rescaled :

θij  C j θij + D j
The correct scaling factors C j , D j cannot be
inferred from the data model.
However we can use priors on the copy number
θij and the relative frequency of alleles A and B.
Human Cancer
Genome
Project
Segmentation to reduce noise
• The true copy number (Allele A+B) is
normally 2 and does not vary across the
genome, except at a few locations
(breakpoints).
• Segmentation can be used to estimate the
location of breakpoints and then we can
average all estimated copy number values
between each pair of breakpoints to reduce
noise.
Human Cancer
Genome
Project
Allelic Frequencies: Cancer & Normal
Human Cancer
Genome
Project
Allelic Frequencies: Cancer & Normal
Human Cancer
Genome
Project
Segmentation & Break-Point Detection
Human Cancer
Genome
Project
Algorithmic Approaches
• Local Approach
– Change-point Detection
• (QSum, KS-Test, Permutation Test)
• Global Approach
– HMM models
– Wavelet Decomposition
• Bayesian & Empirical Bayes Approach
– Generative Models
• (One- or Multi-level Hierarchical)
– Maximum A Posteriori
HMM
Human Cancer
Genome
Project
5
6
4
3
2
1
0
Model with a very high
degree of freedom, but not
enough data points.
Small Sample statistics a
Overfitting, Convergence to
local maxima, etc.
HMM, finally…
Human Cancer
Genome
Project
Model with a very high
degree of freedom, but not
enough data points.
Small Sample statistics a
Overfitting, Convergence to
local maxima, etc.
¸3
·1
2
Human Cancer
Genome
Project
We will simply
model the
number of
break-points by
a Poisson
process, and
lengths of the
aberrational
segments by an
exponential
process.
Two parameter
model: pb & pe
HMM, last time
2
1-pe
pe
pb
1-pb
=2
Advantages:
1. Small Number of
parameters. Can be
optimized by MAP
estimator. (EM has
difficulties).
2. Easy to model
deviation from
Markvian properties
(e.g.,
polymorphisms,
power-law, Polya’s
urn like process,
local properties of
chromosomes, etc.)
Human Cancer
Genome
Project
Generative Model
Amplification, c=4
Breakpoints, Poisson, pb
Segmental Length, Exponential, pe
Copy number, Empirical Distribution
Noise, Gaussian, m, s
Deletion, c=0
Deletion, c=1
Amplification, c=3
sampling = 5
p_e = 0.35
p_b = 0.01
2
chr = 8
1
Copy #
Human Cancer
Genome
Project
A reasonable choice of priors yields
good segmentation.
0
-1
-2
100
300
Probe # 500
700
900
sampling = 5
p_e = 0.55
0.5
p_b = 0.0001
chr = 2
0.0
Copy #
Human Cancer
Genome
Project
A reasonable choice of priors yields
good segmentation.
-0.5
-1.0
50
300
#
550 Probe800
1050
1300
1550
– pe is the probability of a
particular probe being “normal”.
– pb is the average number of
intervals
per unit length.
Max likelihood
over (Pe,Pb)
0.10
213.2
28
4.
3
0.08
236.9
• Priors:
– Deletion + Amplification
• Data:
– Priors + Noise
• Goal: Find the most
plausible hypothesis of
regional changes and their
associated copy numbers
• Generalizes HMM:The prior
depends on two
parameters pe and pb.
0.06
Pb
Human Cancer
Genome
Project
A MAP (Maximum A Posteriori)
Estimators
30
8.0
0.04
33 max at (0.55,0.01)
(pe,pb)
1
.7
0.02
355.4
213.2
118.5
236.9
284.3
260.6
189.5
165.9
142.2
0.00
0.4
0.5
0.6
0.7
Human Cancer
Genome
Project
Likelihood Function
• The likelihood function for first n probes:
• L(h i1, m1, …, ik, mk i)
= Exp(-pb n) (pb n)k
£ (2  s2)(-n/2)i=1n Exp[-(vi - mj)2/2s2]
£ pe(#global)(1-pe)(#local)
– Where ik = n and i belongs to the jth interval.
– Maximum A Posteriori algorithm (implemented as
a Dynamic Programming Solution) optimizes L to
get the best segmentation
• L(h i*1, m*1, …, i*k, m*k i)
Human Cancer
Genome
Project
Dynamic Programming Algorithm
•
•
•
•
•
Generalizes Viterbi and Extends.
Uses the optimal parameters for the generative model:
Adds a new interval to the end:
h i1, m1, …, ik, mk i ± h ik+1, mk+1 i = h i1, m1, …, ik, mk, ik+1, mk+1 i
Incremental computation of the likelihood function:
– Log L(h i1, m1, …, ik, mk, ik+1, mk+1 i)
= –Log L(h i1, m1, …, ik, mki)
+ new-res./2s2 – Log(pbn) +(ik+1 – ik) Log (2s2)
– (ik+1 – ik) [Iglobal Log pe + Ilocal Log(1 – pe)]
Prior Selection: F criterion
Pf over (Pe,Pb)
0.1
0.2
0.08
0.3
0.4
0.1
0.2
0.06
0.7
0.8
0.7
0.3
0.4
0.5
0.6
0.2
0.04
0.7
(pe,pb) max at (0.55,0.01)
0.3
0.4
0.6
0.7
0.02
0.9
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.00
0.4
0.5
0.6
Pe

T2 
0.60.5
0.3
0.8
0.7
• For each break we
have a T2 statistic and
the appropriate tail
probability (p value)
calculated from the
distribution of the
statistic. In this case,
this is an F distribution.
• The best (pe,pb) is the
one that leads to the
maximum min p-value.
0.10
Pb
Human Cancer
Genome
Project
N1 N 2
x y
N1 + N 2
1
df1 + df 2

2
 x  x +  y  y  
2
i
i
2
j
j
0.7
Human Cancer
Genome
Project
Segmentation Analysis
Human Cancer
Genome
Project
Comparative Analysis: BAC Array
Human Cancer
Genome
Project
Comparative Analysis: Nimblegen
Human Cancer
Genome
Project
Comparative Analysis: Affy 10K
Human Cancer
Genome
Project
Simulated Data
• Array CGH simulations and an “ROC analysis”
– Using the same scheme as Lai et al.
• Weil R. Lai, Mark D. Johnson, Raju Kucherlapati, and
Peter J. Park (2005), “Comparative analysis of algorithms
for identifying amplifications and deletions in array CGH
data,” Bioinformatics, 21(19): 3763-3770.
• Segmented by Vmap and DNAcopy
• Vmap algorithm was tested at 11 segmentation Pvalues
of: 0.1, 5 10-2, 10-2, 10-3, 10-4, …, 10-10.
• DNAcopy algorithm was tested at 9 segmentation alpha
values of: .9, .5, .1, 10-2, 10-3, 10-4, …, 10-7.
• Analysis by Alex Pearlman et al. (2006)
Human Cancer
Genome
Project
VMAP
Human Cancer
Genome
Project
DNACopy
Human Cancer
Genome
Project
0.5
-0.5
K1A[, 9]
Tumor 1
0
5000
10000
15000
0.5
-0.5
K2A[, 9]
Tumor
2
Index
0
5000
10000
15000
-0.5
0.5
Index 3
Tumor
K5A[, 9]
Log ratio
Human Cancer
Genome
Project
Prostate Tumor Gains and Losses
Genome view of 19K BAC CGH
0
5000
10000
Index
15000
Human Cancer
Genome
Project
Segmentation of Multi-BAC Events On
Chromosome 13
Proximal breakpoints were identical for T1 and T3.
Distal breakpoints overlapped for T1, T2, and T3.
Normal 1,2,3
Tumor1
Tumor2
Tumor3
Human Cancer
Genome
Project
Further Improvement
• We employed a hierarchical Bayesian
model in which global false discovery
rates can be calculated using the
different levels of the model.
• Noise processes are also estimated
using the appropriate global
parameters.
Human Cancer
Genome
Project
Specific Features of the Model
• We build a model in which, given the region segmentations,
we assume that the copy numbers Ij = region j, (1 · j · k)
in that regions are mutually independent
Gaussian Xi,j» N(qj, sj2), (1 · i · nj)
random variables with mean qj and variance sj2.
• We further assume that each copy region mean parameter qj is
in one of a small number of ‘states’ 2 {1,…,S} with respective
probabilities, 1, …, S of being in state s. qj is in state s (with
probability s) if it has a Gaussian distribution with state mean
qs and state variance ts2 .
• States serve to characterize regions. The state means and
variances are the hyperparameters of the model.
Human Cancer
Genome
Project
Implementation:
Dynamic Programming
• Given the hyperparameters, we segment regions
using a dynamic programming approach. This
consists in constructing probe regions as follows:
– After the (j-1)st region has been constructed:
• A) we choose the next two contiguous regions to the
right of those already constructed by optimizing the
corresponding log likelihood, subject to the condition
that the p-value of the t-statistic distinguishing between
these two (aforementioned) regions is above a given
threshold.
• B) Having chosen these (aforementioned) regions, the
probe regions already constructed, contiguous to them,
may also need to be altered.
Human Cancer
Genome
Project
Segmentation (ROMA,chr3)
Human Cancer
Genome
Project
S*M*A*S*H
Single Molecule Approaches to Sequencing by Hybridization
~Extensions to Optical Mapping~
S*M*A*S*H
Human Cancer
Genome
Project
Fig 1
• Genomic DNA is carefully
extracted from small number
of cells of an organism (e.g.,
human) in normal or
diseased states. (Fig 1 shows
a cancer cell to be studied
for its oncogeneomic
characterization.)
S*M*A*S*H
Human Cancer
Genome
Project
Fig 2
DNA samples are prepared for
analysis with LNA probes and
restriction enzymes.
• LNA probes of length 6 –
8 nucleotides are
hybridized to dsDNA
(double-stranded genomic
DNA) in a test tube (Fig 2)
and the modified DNA is
stretched on a 1” x 1” chip
that has microfluidic
channels manufactured on
its surface. These surfaces
have been chemically
treated to create a positive
charge.
S*M*A*S*H
Human Cancer
Genome
Project
Fig 3
• Since DNA is slightly
negatively charged, it
adheres to the surface as it
flows along these channels
and stretches out. Individual
molecules range in size from
0.3 – 3 million base pairs in
length.
• Next, bright emitters are
attached to the probes on
the surface and the
molecules are imaged (Fig
3).
S*M*A*S*H
Human Cancer
Genome
Project
Fig 4
• A restriction enzyme1 is added
to break the DNA at specific
sites. Since DNA molecules are
under slight tension, the cut
fragments of DNA relax like
entropic springs, leaving small
visible gaps corresponding to
the positions of the restriction
site (Fig 4).
1. A restriction enzyme is a highly specific
molecular scissor that recognizes short
nucleotide sequences and cuts the DNA at only
those recognition sites.
S*M*A*S*H
Human Cancer
Genome
Project
Fig 5
• The DNA is then stained
with a fluorogen (Fig 5)
and reimaged. The two
images are combined to
create a composite
image suggesting the
locations of a specific
short word (e.g.,
probes) within the
context of a pattern of
restriction sites.
S*M*A*S*H
Human Cancer
Genome
Project
Fig 6
– The intensity of the light
emitted by the dye at one
frequency provides a measure
of the length of the DNA
fragments.
– The intensity of the light
emitted by the bright-emitters
on probes provides an intensity
profile for locations of the
probes.
• Images of each DNA
molecule are then
converted into
ideograms, where the
restriction sites are
represented by a tall
rectangle and probe
S*M*A*S*H
Human Cancer
Genome
Project
Fig 7
ATAT
TATC
ATCA
TCAT
CATA
ATATCATAT
• The steps above are
repeated for all possible
probe compositions
(modulo reverse
complementarity).
• Sutta software then
uses the data from all
such individual
ideograms to create an
assembly of the
haplotypic ordered
restriction maps with
approximate probe
locations superimposed
S*M*A*S*H
Human Cancer
Genome
Project
Fig 7
ATAT
TATC
ATCA
TCAT
CATA
ATATCATAT
• Local clusters of
overlapping words are
combined by Sutta’s
PSBH (positional
sequencing by
hybridization) algorithm
to overlay the inferred
haplotypic sequence on
top of the restriction
map (Fig 7).
Gapped Probes
Human Cancer
Genome
Project
• Mixing ‘solid’ bases with `wild-card’ bases:
– E.g., xx*x**x*xx (10-4-mers) or xx*x****x*xx (12-6-mers)
• An ‘wild-card’ base
– Universal: In terms of its ability to form base pairs with the
other natural DNA/RNA bases.
– Applications in primers and in probes for hybridization
• Examples:
– The naturally occurring base hypoxanthine, as its ribo- or 2'deoxyribonucleoside
– 2'-deoxyisoinosine
– 7-deaza-2'-deoxyinosine
– 2-aza-2'-deoxyinosine
Simulation Results
Human Cancer
Genome
Project
• Probe Map Assumptions:
– For single DNA molecules:
•
•
•
•
Probe location Standard Deviation = 240 bases;
Data coverage per probe map = 50x;
Probe hybridization rate = 30%, and
false positive rate of 10 probes per megabase, uniformly
distributed.
– Analytically estimation of the average error rate in
the probe consensus map:
• Probe location SD = 60 bases;
• False Positive rate < 2.4%;
• False Negative rate < 2.0%.
Simulation Results
Human Cancer
Genome
Project
1000
100
1000
Errors per 10kb sequence
Errors per 10kb sequence
10000
100
10
10
1
0
1
2
3
0.1
1
5
6
7
Bases per probe
UNGAPPED
8
0.01
Gapped bases per probe
GAPPED
4
5
Human Cancer
Genome
Project
Simulation Results
• Simulation based on non-random sequences from
the human genome: 96 blocks of 1 Kb (from
chromosome 1) concatenated together along with its
in silico restriction map.
– Error summary for the gapped probe pattern
xx*x **** x*xx:
• Error count excluding repeats or near repeats:
0.32bp / 10Kb
– There is no error due to incorrect rearrangements.
– There is no loss of information at haplotypic level.
– Assembly failed in 2 of 96 blocks of 1kb = 2.1% failure rate
(out of memory).
Human Cancer
Genome
Project
GENomic conTIG
• Gentig uses a purely Bayesian Approach.
– It models all the error processes in the prior.
– FAST: It initially starts with a conservative but
fast pairwise overlap configuration, computed
efficiently using Geometric Hashing.
– ACCURATE: It iteratively combines pairs of maps
or map contigs, while optimizing the likelihood
score subject to a constraint imposed by a falsepositive constraint.
– It has special heuristics to handle non-local errors.
HAPTIG: HAPlotypic conTIG
Human Cancer
Genome
Project
Candida Albicans
FAST & ACCURATE BAYESIAN ALGORITHM
•
•
The left end of chromsome-1
of the common fungus
Candida Albicans (being
sequenced by Stanford).
You can clearly see 3
polymorphisms:
– (A) Fragment 2 is of size
41.19kb (top) vs 38.73kb
(bottom).
– (B) The 3rd fragment of size
7.76kb is missing from the
top haplotype.
– (C)The large fragment in the
middle is of size 61.78kb vs
59.66kb.
Human Cancer
Genome
Project
Lambda DNA with probes
10 mm
A
Human Cancer
Genome
Project
500 nm
Fig. A : Four AFM
images of lambda
DNA with PNA
probes hybridized
to the distal
recognition site,
located 6,900 bp
or 2.28 microns
from the end
(green arrow).
Non-specifically
bound probes
indicated by the
red arrows. Zscale is +/- 1.5
nm.
Human Cancer
Genome
Project
E. coli
Figure 3. Two optical images of E coli K12 genomic DNA after restriction
digestion with 6-cutter restriction enzyme Xho 1 and hybridization with an
8-mer PNA probe. Bound probes are indicated by blue arrows and nonspecifically bound probes by the red arrows. Scale bar shown is 10 micron.
Human Cancer
Genome
Project
Discussions
Q&A…
Human Cancer
Genome
Project
Answer to Cancer
• “If I know the answer I'll tell you
the answer, and if I don't, I'll just
respond, cleverly.”
– US Secretary of Defense, Mr. Donald
Rumsfeld.
Human Cancer
Genome
Project
To be continued…
Break…