Download • Personalized Genomes • SNP Analyses • Effects on Genes • „One

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clinical neurochemistry wikipedia , lookup

Neurogenomics wikipedia , lookup

Biochemistry of Alzheimer's disease wikipedia , lookup

Transcript
From Genotype
to Phenotype
Tim Conrad
Head of AG Computational Proteomics
Institut für Mathematik & Informatik,
Freie Universität Berlin
Today
Personalized Genomes
SNP Analyses
Effects on Genes
„One Genes -> One Protein“
hypotheses revisited
• Role of proteins
• Gene / Proteins for disease
detection
•
•
•
•
Different genes different traits
It‘s the genes` fault!
Personalized Genomics
What are they analyzing?
What determines the individual?
•
Two unrelated
organisms of
the same
species (higher
mammals) share
about 99.5% of their
DNA sequence.
• The rest of about 0.5%
determines the “individual”.
• These differences can be cause
by e.g. SNPs.
• A SNP is defined as a genomic
locus where two or more
alternative bases occur with
appreciable frequency (>1%).
• Occurs every several hundred
bases (~ 10.000.000).
• Most of these have no phenotypic
effect - only <1% of all human
SNPs impact protein function (due
to non-coding regions)
• SNPs can have proxy function
• Whole genome SNP analysis is
possible.
Picture source: Wikipedia
Single Nucleotide Polymorphism (SNP)
Next Generation Sequencing
• 23andMe uses Illumina's
DNA Analysis Beadchips
• HumanHap550-Quad+
platform
• Reads more than 550.000
SNPs, together with a
23andMe custom-designed
set that analyzes more than
30.000 additional SNPs.
Common Haplotypes
• Tag SNPs that contain most of the information about the patterns of human
genetic variation are estimated to be about 300,000 to 600,000, which is far
fewer than the 10 million common SNPs.
What does it tell us?
Genome-wide Association Studies
Idea:
• Scan (Tag-)SNPs across many individuals
to associate alleles with a particular disease
• Use a detected association to detect, treat
and prevent the disease
E.g. eye color, blood type, disease X, …
Very simple possible proposition
“The SNP at APOE position 4075 affects lipid levels!!!”
Apolipoprotein E
0.
0.5
1.
1.5
2.
2.5
3.
3.5
4.
4.5
5.
5.5
Exon 4
Exon 3
Exon 2
Exon 1
5361
5229B
5229 A
4951
4075
4036
3937
3701*
3673
3106
2907
2440
1998
1575
1522
1163
832
624
560
545
471
308
73
The One Gene-One Enzyme Hypothesis
• George Beadle and Edward Tatum (1942)
showed a direct relationship between genes
and enzymes in the haploid fungus
Neurospora crassa.
• This led to their one gene-one enzyme
hypothesis, and a share of the 1958 Nobel
Prize in Physiology or Medicine.
• They built their theory on Archibald Garrod
work who proposed a relationship between
genes and protein production in 1902
Garrod: Biochemical Genetics
…is commonly caused by a mutation on chromosome
12 in the phenylalanine hydrolase gene
Phenylalanine is an essential amino acid, but excess is harmful, and so is normally converted to
tyrosine. Excess phenylalanine affects the CNS, causing mental retardation, slow growth and early
death.
“Real-World Biochemistry”
Aspartame
= a dipeptide:
aspartylphenylalanine
methyl ester
Aspartame is metabolized in the body to its components:
aspartic acid, phenylalanine, and methanol. Like other
amino acids, it provides 4 calories per gram. Since it is
about 180 times as sweet as sugar, the amount of
aspartame needed to achieve a given level of sweetness is
less than 1% of the amount of sugar required. Thus
99.4% of the calories can be replaced. (NutraSweet)
Look on your diet soda cans and read the warning
Contributing factors in causal models
Janssens 2008
Onset age and secular trends reveal
context-specific gene expression
North Karelians in
Finland had the world’s
highest heart disease
rate.
In a period of 20
years, this was
reduced by 75% by
dietary change
Yet there were no
changes in the
Finnish gene pool
Finish Institute of Public Health Newsletter Publication
Contributing factors in causal models
Janssens 2008
Complex traits involve many loci with many
alleles each
Each locus contributes a small amount that depends on its genomic
context. Usually, a few have a few alleles that contribute more, in a
given population or sample.
These Quantitative Trait Loci (QTL) are what mapping finds
Classical polygenes
(individual effects too small to identify, but substantial in aggregate)
Gene networks
http://www.jmdbase.jp/JmdBaseExt/img/AllGeneNetwork.png
A more realistic proposition
“Some combination(s) of SNPs at (or near) APOE
affect(s) some lipid levels in some way(s) at least under
some condition(s), at least to some extent.”
(Possibly !)
How would we know? Are propositions like these
Falsifiable? Verifiable? Replicable? Predictive?
Significant? Deductive? Probabilistic? Intuitive? Based
on preconceptions or scientific inertia?
From genes to proteins
Proteomics In Disease Treatment
• Nearly all major diseases—more than
98% of all hospital admissions—are
caused by an particular pattern in a
group of genes.
• Isolating this group by comparing the
tens of thousands of genes in each of
many genomes would be very
impractical.
• Looking at the proteomes of the cells
associated with the disease is much
more efficient.
Why Proteomics?
Contrarily to the static Genome, the Proteome is highly dynamic!
Central Dogma and Degradation
DNA
Genome
Transcriptome
mRNA
Protein
?
Metabolite
Degradation
?
?
Proteome
?
Metabolome
Clinical Response
Clinical Response
Drug Response Biomarker
Drug Response Biomarker
Central Dogma - REVISITED
Tim
Stefan
Pooja
+
Axel,
Sharon
N/N
Sandro
Current Projects
Biomarker identification
Protein
Bioprint identification
degradation
+
QTL Analysis
Raw Data Analysis,
Feature Selection,
Protein ID
Introduction
Blood is a complex mixture of thousands of different molecules.
Some of these proteins only occur if we suffer from a disease.
• How can we identify these proteins and use them in diagnostics ?
• Does a disease has a „fingerprint“ ?
• Can it help matching patients and drugs ?
31
Medical Analysis Pipeline
# Proteins Analysis
Pipeline
Molecule Seq.: AIILLQY,
Similarity to p53: 98%
Possible Protein
Fingerprint for
this disease
Protein Mass Proteomics Fingerprinting Pipeline
TC et a. (2006), Lecture Notes in Computer Science, 4216
(1) Very sensitive peak finding across
spectra from same group
Aim: Enable reliable peak assignment & property comparisons across spectra
Assign peaks to clusters (Masterpeaks)
(6d) Multi-dimensional Bayesian Clustering based on property similarity
(since clusters are not spherical)
Preprocessing
Superimposing
Clustering
ℜ5000
Each spectrum: Point in ℜ100.000
Point in ℜ 2000
Why bother for the small peaks?
• Detection and processing of
much more (i.e. smaller)
peaks in less time
• Small peaks are often
hormones or other
important low-abundant
peptides
Better fingerprints!
w/o
Cell
~80% of all peaks
Benchmarks – Cell Port
Peak Detection of a 1d-spectrum
(2d case ~1500 1d spectra)
Platform
Cores
Time
Intel 3.2GHz
1
9 sec
Intel 3.2GHz
2
7 sec
Intel 3.2GHz
1
250ms (6min)
PS3
1
110ms
PS3
6
22ms
QS22
12
18ms
Joint work with Christopher Thöns, Marco Ziegert
(~3h)
(0.5min)
„Bad“
Java code
Highly
Optimized
C code
Find the Features
100s
… …
(1000s)
100s
Find alignments (pairs)
of Masterpeaks by
Calculate Jensen-Shannon
divergence of distributions
Bipartite Graph Matching
for each alignment
Height
Peak
Center
…
Width
Each spectrum: Point in ℜ500
Find the Fingerprints
Fingerprint of dimension d=100s
ColonCA, n=86
…
Healthy, n=150
Fingerprint of dimension m <= d
Perform manifold-learning for
dimensionality reduction
Each spectrum: Point in
ℜ2
Assumption: informative part of data lies on manifold of lower dimension.
Find mapping
such that reconstruction error
is small.
38
Mass Spectrum Representation
Peak Finding &
Clustering
(spectra of same
group, eg. healthy)
Each 1d spectrum: Point in
ℜ10.000
to filter
out noise
Each spectrum: Point in
ℜ 3000
Feature selection
Disease A
Healthy
Each spectrum: Point in ℜ
(values are peak intensities)
10
Mass Data Processing
+ the above within Grid or Cluster Systems (~240 CPU Cores)
It works
• Biomarker identification
•
Fiedler, TC et al. (2009) Serum Peptidome Profiling Revealed Platelet
Factor 4 as a Potential Discriminating Peptide Associated With
Pancreatic Cancer. Clinical Cancer Research
•
Strenziok, TC et al. (2009) Serum proteomic profiling by surfaceenhanced laser desorption/ionization time-of-flight mass spectrometry in
testicular germ cell cancer patients. World J of Urology .
Preliminary results: fingerprints for 5 different cancer types two of
which have proven successful in first clinical studies.
Current Projects
Biomarker identification
Protein
Bioprint identification
degradation
+
QTL Analysis
Raw Data Analysis,
Feature Selection,
Protein ID
De novo peptide identification
Transform MS/MS spectrum into spectrum graph
−
generate a node for each peak
−
paths correspond to peptides
−
set of directed edges (E) and
undirected edges (H)
−
search for (longest) antisymmetric
paths (NP-hard)
−
connect nodes by directed edge if
their mass difference equals the
mass of some amino acid
−
connect nodes by undirected
edges if they correspond to
complementary ions
De novo peptide identification
Solve the longest antisymmetric path problem by linear
optimization.
For each directed edge (i,k) one variable xik ∊ {0,1}
if xik = 1
edge (i,k) belongs to the path
if xik = 0
edge (i,k) does not belong to the path
De novo peptide identification
Solve the longest antisymmetric path problem by linear
optimization.
For each directed edge (i,k) one variable xik ∊ {0,1}
if xik = 1
edge (i,k) belongs to the path
if xik = 0
edge (i,k) does not belong to the path
Lagrangian Relaxation
-relax the antisymmetry
constraints and add penalty term
to objective function
-Solve dual problem by
iteratively solving the relaxed
(Lagrange) problem by
computing longest paths in a
DAG (easy)
Current Projects
Biomarker identification
Protein
Bioprint identification
degradation
+
QTL Analysis
Raw Data Analysis,
Feature Selection,
Protein ID
From degradomes towards disease-associated proteases
Experiment: Take blood from one donor, and produce mass spectra every x hours.
Yi et al. Journal of Proteome Research 2007, 6, 1768-1781
Generation of serum peptide signatures
abundant blood proteins
A-C-D-E-F-G-H-I-K-L-M-N-P-Q-R-S-T-V-W-Y
+
endoproteases
+ A-C-D-E-F-G-H-I-K-L-M-N
+exoprotease
A-C-D-E-F-G-H-I-K-L-M-N
A-C-D-E-F-G-H-I-K-L-M
A-C-D-E-F-G-H-I-K-L
A-C-D-E-F-G-H-I-K
A-C-D-E-F-G-H-I
A-C-D-E-F-G-H
+ P-Q-R-S-T-V-W-Y +
+exoprotease
A-C-D-E-F-G-H-I-K-L-M-N
C-D-E-F-G-H-I-K-L-M-N
D-E-F-G-H-I-K-L-M-N
E-F-G-H-I-K-L-M-N
F-G-H-I-K-L-M-N
G-H-I-K-L-M-N
Degradation Graph
Transform multiple spectra into a degradation graph using
• known precursor peptides
• MS2 identifications
Model generation
dC1 = −k1_ 2C1 − k1_ 34 C1
dC2 = k1_ 2C1
dC3 = k1_ 34 C1
dC4 = k1_ 34 C − k 4 −5C4
dC5 = k4 −5C4
Generate mathematical models based on the graph to
• Estimate reaction parameters
• classify diseased/healthy samples
• ...
Glueing it together
Genomics & Proteomics = BioPrints
+
(and even other infos)
Extensions
BioPrints
AIM: given bio-material (fluids, DNA, skin, …)
of ONE individual determine multi-dimensional
fingerprint and relate to particular diseases
•
•
•
•
Fingerprinting based
Disease based
Integration of different data sources
How to link ?
Acknowledgements
BioInformatics @ Free University Berlin
AG BioComputing
AG Algorithmic Bioinformatics
Prof. Christof Schütte
Prof. Knut Reinert
AG Comp. Proteomics
Stephan Aiche, Sandro Andreotti,
Sharon Bruckner, Pooja Gupta, Axel Rack
ILM / University Hospital Leipzig
Charité Berlin
Prof. Joachim Thiery
Microsoft Research, Cambridge
Dr. Andre Hagehülsmann
IBM Deutschland
Mehr Informationen im Internet unter
msproteomics.net
Vielen Dank!
Tim Conrad ([email protected])
AG Computational Proteomics
www.msproteomics.net
Summary
Weitere
Fragen