Download (1) in ppt - NYU Computer Science Department

Document related concepts

Microevolution wikipedia , lookup

Point mutation wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Human genetic variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Transposable element wikipedia , lookup

Designer baby wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Public health genomics wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Pathogenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Cancer epigenetics wikipedia , lookup

NUMT wikipedia , lookup

Mutagen wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Minimal genome wikipedia , lookup

Genome (book) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Helitron (biology) wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

ENCODE wikipedia , lookup

Genomic library wikipedia , lookup

Genomics wikipedia , lookup

Human genome wikipedia , lookup

Genome editing wikipedia , lookup

Oncogenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Human Genome Project wikipedia , lookup

Transcript
(I)
Human Cancer
Genome
Project
Computational Systems Biology of Cancer:
Human Cancer
Genome
Project
Bud Mishra
Professor of Computer Science,
Mathematics and Cell Biology
¦
Courant Institute, NYU School of Medicine, Tata Institute of
Fundamental Research, and Mt. Sinai School of Medicine
Human Cancer
Genome
Project
Human Cancer
Genome
Project
War on Cancer
• “Reports that say that something hasn't
happened are always interesting to me,
because as we know, there are known
knowns; there are things we know we
know. We also know there are known
unknowns; that is to say we know there
are some things we do not know. But
there are also unknown unknowns –
the ones we don't know we don't know.”
– US Secretary of Defense, Mr. Donald
Rumsfeld, Quoted completely out of context.
Human Cancer
Genome
Project
Introduction: Cancer and Genomics:
What we know & what we do not
“Cancer is a disease of the
genome.”
Human Cancer
Genome
Project
Outline
• Genomics
• Genome Modification & Repair
• Segmental Duplications Models
Human Cancer
Genome
Project
Genomics
• Genome:
– Hereditary information of an organism is
encoded in its DNA and enclosed in a cell
(unless it is a virus). All the information
contained in the DNA of a single organism
is its genome.
• DNA molecule can be thought of as a very
long sequence of nucleotides or bases:
S = {A, T, C, G}
Human Cancer
Genome
Project
Complementarity
• DNA is a doublestranded polymer and
should be thought of as
a pair of sequences
over S.
• However, there is a
relation of
complementarity
between the two
sequences:
– A , T, C , G
Human Cancer
Genome
Project
DNA Structure.
• The four nitrogenous bases of
DNA are arranged along the
sugar- phosphate backbone in a
particular order (the DNA
sequence), encoding all genetic
instructions for an organism.
• Adenine (A) pairs with thymine
(T), while cytosine (C) pairs with
guanine (G).
• The two DNA strands are held
together by weak bonds between
the bases.
Human Cancer
Genome
Project
Structure and Components
• Complementary base pairs
(A-T and C-G)
• Cytosine and thymine are smaller (lighter)
molecules, called pyrimidines
• Guanine and adenine are bigger (bulkier)
molecules, called purines.
• Adenine and thymine allow only for double
hydrogen bonding, while cytosine and guanine
allow for triple hydrogen bonding.
Human Cancer
Genome
Project
Inert & Rigid
• Thus the chemical (hydrogen bonding)
and the mechanical (purine to
pyrimidine) constraints on the pairing
lead to the complementarity and makes
the double stranded DNA both
chemically inert and mechanically quite
rigid and stable.
The Central Dogma
Human Cancer
Genome
Project
DNA
Transcription
RNA
Translation
Protein
• The central dogma(due to
Francis Crick in 1958)
states that these
information flows are all
unidirectional:
“The central dogma
states that once
`information' has
passed into protein it
cannot get out again.”
Human Cancer
Genome
Project
The Central Dogma
“…The transfer of information from nucleic
acid to nucleic acid, or from nucleic acid to
protein, may be possible, but transfer from
protein to protein, or from protein to
nucleic acid is impossible. Information
means here the precise determination of
sequence, either of bases in the nucleic
acid or of amino acid residues in the
protein.”
DNA
Transcription
RNA
Translation
Protein
Human Cancer
Genome
Project
Genome
Evolution
The New Synthesis
DNA
Transcription
RNA
Protein
Translation
Part-lists, Annotation, Ontologies
Selection
Human Cancer
Genome
Project
Cancer Initiation and Progression
Mutations, Translocations,
Amplifications, Deletions
Epigenomics (Hyper & HypoMethylation)
Alternate Splicing
Cancer Initiation and Progression
Proliferation, Motility,
Immortality,
Metastasis, Signaling
Human Cancer
Genome
Project
Multi-step Nature of Cancer:
• Cancer is a stepwise process, typically
requiring accumulation of mutations in
a number of genes.
• ~6-7 independent mutations typically
occur over several decades:
– Conversion of proto-oncogenes to
oncogenes
– Inactivation of tumor suppressor gene
Human Cancer
Genome
Project
Amplifications & Deletions
Mutation in a TSG
Epigenomics
Conversion of a
Proto-Oncogene
Deletion of a TSG
Deletion of a TSG
Human Cancer
Genome
Project
P53 Gene (TSG)
Human Cancer
Genome
Project
The Cancer Genome Atlas
• Obtain a comprehensive
description of the genetic basis of
human cancer.
– Identify and characterize all the sites of
genomic alteration associated at significant
frequency with all major types of cancers.
Human Cancer
Genome
Project
The Cancer Genome Atlas
• Increase the effectiveness of research
to understand
– tumor initiation and progression,
– susceptibility to carcinogensis,
– development of cancer therapeutics,
– approaches for early detection of tumors &
– the design of clinical trials.
Human Cancer
Genome
Project
Specific Goals
• Identify all genomic alterations
significantly associated with all major
cancer types.
• Such knowledge will propel work by
thousands of investigators in cancer
biology, epidemiology, diagnostics and
therapeutics.
Human Cancer
Genome
Project
To Achieve this goal …
• Create large
collection of
appropriate,
clinically
annotated
samples from all
major types of
cancer; and
• Characterize each sample in
terms of:
– All regions of genomic loss or
amplification,
– All mutations in the coding
regions of all human genes,
– All chromosomal
rearrangements,
– All regions of aberrant
methylation, and
– Complete gene expression
profile, as well as other
appropriate technologies.
Human Cancer
Genome
Project
Biomedical Rationale
• Cancer is a heterogeneous collection of
heterogeneous diseases.
– For example, prostate cancer can be an
indolent disease remaining dormant
throughout life or an aggressive disease
leading to death.
– However, we have no clear understanding
of why such tumors differ.
Human Cancer
Genome
Project
Biomedical Rationale
• Cancer is fundamentally a disease of genomic
alteration.
– Cancer cells typically carry many genomic
alterations that confer on tumors their distinctive
abilities (such as the capacity to proliferate and
metastasize, ignoring the normal signals that block
cellular growth and migration) and liabilities (such
as unique dependence on certain cellular
pathways, which potentially render them sensitive
to certain treatments that spare normal cells).
History
Human Cancer
Genome
Project
• 1960s
– The genetic basis of cancer was clear from cytogenetic studies that
showed consistent translocations associated with specific cancers
(notably the so-called Philadelphia chromosome in chronic
myelogenous leukemia).
• 1970s
– Recognize specific cancer-causing mutations through recombinant
DNA revolution of the 1970s.
– The identification of the first vertebrate and human oncogenes and
the first tumor suppressor genes,
– These discoveries have elucidated the cellular pathways governing
processes such as cell-cycle progression, cell-death control, signal
transduction, cell migration, protein translation, protein degradation
and transcription.
• For no human cancer do we have a comprehensive
understanding of the events required.
Human Cancer
Genome
Project
Scientific Foundation for a Human Cancer
Genome Project
•
Gene resequencing.
– Specific gene classes (such as
kinases and phosphatases) in
particular cancer types.
•
Epigenetic changes.
– Loss of function of tumor
suppressor genes by epigenetic
modification of the genome —
such as DNA methylation and
histone modification.
•
Genomic loss and
amplification.
– Consistent association with
genomic loss or amplification in
many specific regions,
indicating that these regions
harbor key cancer associated
genes
•
Chromosome
rearrangements.
– Activate kinase pathways
through fusion proteins or
inactivating differentiation
programs through gene
disruption.
– Hematological malignancies: a
single stereotypical
translocation in some diseases
(such as CML) and as many as
20 important translocations in
others (such as AML).
– Adult solid tumors have not
been as well characterized, in
part owing to technical hurdles.
Human Cancer
Genome
Project
Human Genome Structure
Human Cancer
Genome
Project
EBD
• J.B.S. Haldane (1932):
– “A redundant duplicate of
a gene may acquire
divergent mutations and
eventually emerge as a
new gene.”
• Susumu Ohno (1970):
– “Natural selection merely
modified, while
redundancy created.”
Human Cancer
Genome
Project
Evolution by Duplication
Organism
Living cells
Functional
modules
Motifs in cellular etworks
Leu3
LEU1
BAT1
UMP
ILV2
Regulatory motifs
UTP
UDP
CTP
Metabolic pathways
Interaction clusters
Non-random distribution in Genomic data
?
Long range orrelation
Short words frequency
Genome Evolution
Protein family size
Human Cancer
Genome
Project
Human Condition
Human Cancer
Genome
Project
Mer-scape…
•Overlapping words of different sizes are analyzed for their frequencies along the whole
human genome
•Red: 24-mers,
• Green: 21-mers
• Blue:18 mers
• Gray:15 mers
• To the very left is a ubiquitous human transposon Alu. The high frequency is indicative
of its repetitive nature.
•To the very right is the beginning of a gene. The low frequency is indicative of its
uniqueness in the whole genome.
Doublet Repeats
Human Cancer
Genome
Project
• Serendipitous discovery of a new
uncataloged class of short
duplicate sequences; doublet
repeats.
Number of elements in 10 Mb window
– almost always < 100 bp
800
600
400
200
0
0.00
57500000.00
115000000.00
172500000.00
Start of 10 Mb window on chromosome 2
230000000.00
• (Top) . The distance between
the two loci of a doublet is
plotted versus the chromosomal
position of the first locus.
• (Bottom) : Distribution of
doublets (black) and segmental
duplications (red) across human
chromosome 2
Segmental Duplications
Human Cancer
Genome
Project
• 3.5% ~ 5% of the human
genome is found to contain
–
Human
segmental duplications,
with length > 5 or 1kb,
identity > 90%.
• These duplications are
estimated to have emerged
about 40Mya under neutral
assumption.
• The duplications are mostly
interspersed (non-tandem),
and happen both inter- and
intra-chromosomally.
From [Bailey, et al. 2002]
Human Cancer
Genome
Project
The Model
f --
f --
deletion or
mutation
insertion
f +-
f +Duplication by recombination
between other repeats or
other mechanisms
deletion or
mutation
insertion
f ++
Duplication by recombination
between repeats
f ++
Mutation accumulation in the
duplicated sequences
The Mathematical Model
Human Cancer
Genome
Project
Time after duplication
1-α-2β
h0--
1-α-2β
α
1-α-2β
α
α
α
f --
α
f +-
h1-γ
H0
h0
2β
h0+-
γ
α
2γ
h0++
h1
2γ
α
1-α-β/2-γ
β/2
ε ≤ d < 2ε
2β
α
1-α-β/2-γ
β/2
0≤d<ε
γ
α
1-α-β/2-γ
H1
2β
2γ
α
α
β/2
(k-1)ε ≤ d < kε
h1++
1-α-2γ
1-α-2γ
h1: proportion of duplications by repeat recombination;
h1++: proportion of duplications by recombination of the specific repeat;
h1- - : proportion of duplications by recombination of other repeats;
h0: proportion of duplications by other repeat-unrelated mechanism;
h0++: proportion of h0 with common specific repeat in the flanking regions;
h0+-: proportion of h0 with no common specific repeat in the flanking regions;
h0- -: proportion of h0 with no specific repeat in the flanking regions;
α
f ++
1-α-2γ
α: mutation rate in duplicated sequences;
β: insertion rate of the specific repeat;
γ: mutation rate in the specific repeat;
d: divergence level of duplications;
ε: divergence interval of duplications.
Model Fitting
Human Cancer
Genome
Project
Alu
L1
f -f --
f +-
f +f ++
Diversity:
f ++
Diversity:
•The model parameters (αAlu, βAlu, γAlu, αL1, βL1, γL1) are estimated from the
reported mutation and insertion rates in the literature.
•The relative strengths of the alternative hypotheses can be estimated by
model fitting to the real data.
h1Alu ≈ 0.76; h1++Alu ≈ 0.3; h1L1 ≈ 0.76; h1++ L1 ≈ 0.35.
Polya’s Urn
Human Cancer
Genome
Project
Random drawn
Put back
Duplication
Human Cancer
Genome
Project
Repetitive Random Eccentric GOD
• Genome Organizing Devices (GOD)
• Polya’s Urn Model:
• F’s: functions deciding probability distributions
F1 (decide an initial position)
F2 (decide selected length)
F3 (decide copy number)
k copies
F4 (decide insertion positions)
DNA polymerase stuttering
(replication slippage)
Human Cancer
Genome
Project
•
Deletion:
• Insertion (duplication):
normal
replication
DNA polymerase
normal
replication
polymerase pausing
and dissociation
polymerase pausing
and dissociation
3’ realignment and
polymerase reloading
3’ realignment and
polymerase reloading
Transposons
Human Cancer
Genome
Project
~ causes: deletions and duplications
•Transposon actions in
genomic DNA:
Donor DNA
•Duplication:
transposon
DNA
intermediates
Target DNA
•Deletion:
IS
Transposon
looping out
Transposon
deleted
transposon IS
Transposase
cuts in target
DNA
Transposon
inserted
DNA is repaired-resulting in a duplication of the
transposon and target site
DNA mismatch repair mechanism
~ prevents duplications and deletions
Human Cancer
Genome
Project
Mismatch
recognition on
daughter strand
DNA a-loop
formation by
translocation
through the proteins
MSH6
MSH2
Exo
PMS2
Degradation
of the
mismatched
daughter
strand in the
a-loop
MLH1
MLH1
PMS2
MSH2 MSH6
MSH2 MSH6
Refilling the gap by
DNA polymerase
DNA polymerase
Corrected
daughter
strand
Human Cancer
Genome
Project
Graph Model
A. Deletion: With probability p0
B. Duplication:With probability p1
C. Substitution: With probability q
Graph description
•G = {V, E}, G is a directed multi-graph;
•V  {Vi, all n-mer’s, i = 1…4n},
;
•E  { (Vi , Vj ), when Vi represents the n-mer
immdiately upstream of the n-mer represented by Vj
in the genomic sequence};
•ki = incoming (or outgoing) degree of node i (Vi) =
copy number of the n-mer represented by Vi;
•During the graph evolution, at each iteration, one of
the following happens: deletion (with probability
p0), duplication (with probability p1) or substitution
(with probability q). p0 + p1 + q = 1.
•For an arbitrary node Vi , the probabilities of one of
the above events happens is as follows:
P (Vi is deleted )  p0
ki
; P (Vi is substitute d)  q
N
k
j 1
P (Vi duplicated )  p1
j
ki
j 1
k
j 1
;P (V substitute s)  q
i
N
k
ki
j
;
N
j
1
;
N 1
Human Cancer
Genome
Project
Model Fitting
Real distribution
from genome
analysis
Expected
distribution
from random
sequences
Model fitting
results
Human Cancer
Genome
Project
Model Fitted Parameter
Mer size
6 mer
q
p1/p0
7 mer
q
p1/p0
8 mer
q
p1/p0
9 mer
q
p1/p0
M. genitalium
0.0176
1.1
0.0436
1.1
0.1587
1.5
0.3222
1.5
M. pneumoniae
0.0319
1.1
0.1151
1.5
0.2309
1.5
0.4363
1.5
P. abyssi
0.0269
1.1
0.0672
1.2
0.1778
1.4
0.3897
1.5
P. horikoshii
0.0234
1.1
0.0443
1.1
0.1390
1.3
0.3456
1.5
P. furiosus
0.0213
1.1
0.0384
1.1
0.1119
1.2
0.3114
1.5
H. pylori
0.0180
1.1
0.0320
1.1
0.0925
1.3
0.2262
1.5
H. influenzae
0.0202
1.1
0.0366
1.1
0.1364
1.5
0.2802
1.5
S. tokodaii
0.0180
1.1
0.0320
1.1
0.0925
1.3
0.2262
1.5
S. subtilis
0.0187
1.1
0.0326
1.1
0.1139
1.4
0.2585
1.5
E. coli K12
0.0207
1.1
0.0334
1.1
0.0698
1.1
0.2389
1.5
S. cerevisiae
0.0113
1.1
0.0176
1.1
0.0459
1.2
0.1311
1.4
C.elegans
--
--
0.0076
1.1
0.0115
1.1
0.0275
1.2
•The substitution
rate, q, increases
with the sizes of
mer’s.
•The ratio between
duplication and
deletion rate,
p1/p0, increases
with sizes of mer’s.
•The substitution
rate, q, tends to
decrease when the
genome sizes are
larger. Especially, q
is much smaller in
eukaryotic genomes
than in prokaryotic
genomes.
Human Cancer
Genome
Project
J.B.S. Haldane
• “If I were compelled to give my own
appreciation of the evolutionary process…, I
should say this: In the first place it is very
beautiful. In that beauty, there is an element of
tragedy…In an evolutionary line rising from
simplicity to complexity, then often falling back
to an apparently primitive condition before its
end, we perceive an artistic unity …
• “To me at least the beauty of evolution is far
more striking than its purpose.”
• J.B.S. Haldane, The Causes of Evolution.
1932.
Human Cancer
Genome
Project
Human Cancer Genome
Human Cancer
Genome
Project
Cancer
Normal epithelial mucosa
Neoplastic polyp
Human Cancer
Genome
Project
A Challenge
• “At present, description of a recently
diagnosed tumor in terms of its underlying
genetic lesions remains a distant prospect.
Nonetheless, we look ahead 10 or 20 years to
the time when the diagnosis of all somatically
acquired lesions present in a tumor cell
genome will become a routine procedure.”
– Douglas Hanahan and Robert Weinberg
• Cell, Vol. 100, 57-70, 7 Jan 2000
Human Cancer
Genome
Project
Karyotyping
CGH:
Human Cancer
Genome
Project
Comparative Genomic Hybridization.
•
•
•
•
Amplification
Deletion
Equal amounts of biotin-labeled
tumor DNA and digoxigenin-labeled
normal reference DNA are
hybridized to normal metaphase
chromosomes
The tumor DNA is visualized with
fluorescein and the normal DNA
with rhodamine
The signal intensities of the
different fluorochromes are
quantitated along the single
chromosomes
The over-and underrepresented
DNA segments are quantified by
computation of tumor/normal ratio
images and average ratio profiles
Human Cancer
Genome
Project
CGH: Comparative Genomic
Hybridization.
Human Cancer
Genome
Project
Microarray Analysis of Cancer
Genome
Normal DNA
• Representations are reproducible
samplings of DNA populations in
Tumor DNA
which the resulting DNA has a new
format and reduced complexity.
Normal LCR
Tumor LCR
Label
Hybridize
– We array probes derived from low
complexity representations of the
normal genome
– We measure differences in gene
copy number between samples
ratiometrically
– Since representations have a lower
nucleotide complexity than total
genomic DNA, we obtain a stronger
specific hybridization signal relative
to non-specific and noise
Human Cancer
Genome
Project
Copy Number Fluctuation
A1
B1
C1
A2
B2
C2
A3
B3
C3
Human Cancer
Genome
Project
Measuring gene copy number differences
between complex genomes
Genomic DNA
patient samples
Hybridize to
500,000 features
microarray
Statsitical
Analysis


• Compare the genomes of
diseased and normal samples
• Error Control:
– The use of representations
augmenting microarrays
– Representations reproducibly
sample the genome thereby
reducing its complexity. This
increases the signal-to-noise
ratio and improves sensitivity
– Statistical Modeling the sources
of Noise
– Bayesian Analysis
Human Cancer
Genome
Project
Nattering Nabobs of Negativism
• Some scientists are concerned about the
cost and the possibility that a project of this
scale could take money away from smaller
ones…
• Craig Venter, who led a private project to
determine the human DNA blueprint in
competition with the human genome project, said
it would make more sense to look at specific
families of genes known to be involved in cancer.
• Lee Hood, president of the Institute for Systems
Biology, has called the premise of the Cancer
Genome Project “naïve,” suggesting that signal-tonoise issues its researchers are likely to encounter
will be “absolutely enormous.”
Human Cancer
Genome
Project
Challenges
• ArrayCGH data is noisy!
– Better computational biology algorithms
– Better statistical modeling…In some technologies, the noise is
systematic (e.g., affy SNP-chips)
– Better bio-technologies (GRIN: Genomics, Robotics, Informatics,
Nanotechnology)
• No efficient technology for epigenomics, translocations or
de novo mutations.
• Algorithms for multi-locus association studies
• Systems view of cancer by integrating data from multiple
sources:
– Genomic, Epigenomic, Transcriptomic, Metabolomic & Proteomic.
– Regulatory, Metabolic and Signaling pathways
Human Cancer
Genome
Project
Mishra’s Mystical 3M’s
• Rapid and accurate
solutions
Measure
Mine
Model
– Bioinformatic, statistical,
systems, and
computational
approaches.
– Approaches that are
scalable, agnostic to
technologies, and widely
applicable
• Promises, challenges
and obstacles—
Human Cancer
Genome
Project
Discussions
Q&A…
Human Cancer
Genome
Project
Answer to Cancer
• “If I know the answer I'll tell you
the answer, and if I don't, I'll just
respond, cleverly.”
– US Secretary of Defense, Mr. Donald
Rumsfeld.
Human Cancer
Genome
Project
To be continued…
Break…