Download ppt - Chair of Computational Biology

Document related concepts

Protein wikipedia , lookup

Molecular ecology wikipedia , lookup

RNA-Seq wikipedia , lookup

Interactome wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Genomic library wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Proteolysis wikipedia , lookup

Metalloprotein wikipedia , lookup

Biochemistry wikipedia , lookup

Genetic code wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein structure prediction wikipedia , lookup

Point mutation wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
Bioinformatics III (“Systems biology”)
Course will address two areas:
- analysis and comparison of whole genome sequences
- „systems biology“ – integrated view of cellular networks
1. Lecture WS 2003/04
Bioinformatics III
1
Whole Genomes - Content
some topics were already
covered in Bioinformatics 1
lecture by Prof. Lenhof
genome assembly
gene finding
genome alignment
whole genome comparison (prokaryotes, human  mouse)
genome rearrangements
transcriptional regulation
functional genomics
phylogeny
single nucleotide polymorphisms (SNPs)
1. Lecture WS 2003/04
Bioinformatics III
2
Cellular Networks - Content
network topologies: random networks, scale free networks
robustness of networks
expression analysis
metabolic networks, metabolic flow analysis
linear systems, non-linear dynamics
molecular systems biology: protein-protein interaction networks
molecular machines ...
1. Lecture WS 2003/04
Bioinformatics III
3
Literature
whole genome sequences
e.g. David Mount, Bioinformatics
Chapters 6, 8, 10
€ 68
system biology
mostly taken from original literature
Web-resources
Institute of Systems Biology, Seattle, WA
http://www.systemsbiology.org/
-
The systems biology institute
http://www.systems-biology.org/
1. Lecture WS 2003/04
Bioinformatics III
4
assignments
12 weekly assignments planned
Homeworks are handed out in the Tuesday lectures and are available on
our webserver http://gepard.bioinformatik.uni-saarland.de on the same day.
Solutions need to be returned until Tuesday of the following week 14.00
in room 1.05 Geb. 17.1, first floor, or handed in prior (!) to the lecture starting
at 14.15.
In case of illness please send E-mail to:
[email protected] and provide a medical certificate.
1. Lecture WS 2003/04
Bioinformatics III
5
Schein = successful written exam
The successful participation in the lecture course („Schein“) will be certified upon
successful completion of the written exam on Feb. 18, 2004.
Participation at the exam is open to those students who have received  50% of
credit points for the 12 assignments.
Unless published otherwise on the course website until Feb. 4, the exam will be
based on all material covered in the lectures and in the assignments.
In case of illness please send E-mail to:
[email protected] and provide a medical certificate.
A „second and final chance“ exam may be offered at the beginning of
April 2004 to those who failed the first exam and those who missed the first exam
due to illness (medical certificate required).
1. Lecture WS 2003/04
Bioinformatics III
6
tutors
Prof. Dr. Volkhard Helms
Sprechstunde: Tue 10-12. Geb. 17.1, room 1.06.
Generally, I am also available after the lectures.
Dr. Tihamer Geyer – assignments for network part
Geb. 17.1, room 1.09.
guest lecturers+tutors
1. Lecture WS 2003/04
Bioinformatics III
7
Tree of Life
Bacteria
Archaea
Eukarya
Euryarchaeota
 Purple bacteria
 Gram-positive
Cyanobacteria
Chlamydiae
Methanosarcina
 Methanobacterium
 Methanococcus
Thermococcus
Crenarchaeota
Flavobacteria
 Spirochetes
 Thermotogales
Thermoproteus
Pyrodictium
Deinococci
Green nonsulfur bacteria
 Halophiles
 Animals
 Fungi
Thermoplasma
 Plants
Slime molds
Entamoebae
Ciliates
 Stramenophiles
 Trichomonads
 Aquifex
1. Lecture WS 2003/04
Microsporidia
Diplomonads
Bioinformatics III
8
Genomes
A genome is the entire genomic material of any of these biological organism.
We will review genome organization, known sequences, genome language,
sequencing details etc. in the next lecture.
Now that we have genome information from multiple organisms
I see the following issues:
1 what biological questions do we ask?
2 what bioinformatics tools do we need to find the answers?
3 what are the answers?
1. Lecture WS 2003/04
Bioinformatics III
9
Why mouse?
19 mouse chromosomes.
Genetecists have anxiously awaited the recently
published draft version of the Mouse genome? Why?
Mouse as a close relative to humans is a unique lens
through which we can view ourselves.
As the leading mammalian system for genetic
research over the past century it has provided a
model for human physiology and disease.
Comparative genomics makes it possible to discern
biological features that would otherwise escape our
notice.
Nature 420, 520 (2002)
1. Lecture WS 2003/04
Bioinformatics III
10
How do we compare genomes?
Conservation of synteny between human and mouse. 558,000 highly
conserved, reciprocally unique landmarks were detected within the mouse and
human genomes, which can be joined into conserved syntenic segments and
blocks. A typical 510-kb segment of mouse chromosome 12 that shares common
ancestry with a 600-kb section of human chromosome 14 is shown. Blue lines
connect the reciprocal unique matches in the two genomes. In general, the
landmarks in the mouse genome are more closely spaced, reflecting the 14%
smaller overall genome size.
Nature 420, 520 (2002)
1. Lecture WS 2003/04
Bioinformatics III
11
Genome rearrangements
Segments and blocks >300 kb in size with conserved synteny in human are
superimposed on the mouse genome. Each colour corresponds to a particular
human chromosome. The 342 segments are separated from each other by thin,
white lines within the 217 blocks of consistent colour.
Genome rearrangments have functional implications (will be discussed later).
Nature 420, 520 (2002)
1. Lecture WS 2003/04
Bioinformatics III
12
Review: Pairwise sequence alignment
•
dynamic programming: Needleman-Wunsch, Smith Waterman
•
sequence alignments
•
substition matrices
•
significance of alignments
•
BLAST, algorithmn – parameters – output http://www.ncbi.nih.gov
•
This part of lecture taken from
•
O’Reilly book on “BLAST” by Korf, Yandell, Bedell
•
see also Bioinformatik I lecture by Prof. Lenhof
•
weeks 3 and 5
1. Lecture WS 2003/04
Bioinformatics III
13
Sequence alignment
• When 2 or more sequences are present one would like
- to detect quantitatively their similarities
- discover equivalences of single sequence motifs
- observe regularities of conservation and variability
- deduce historical relationships
- important goal: annotation of structural and functional properties
assumption: sequence, structure, and function are inter-related.
1. Lecture WS 2003/04
Bioinformatics III
14
Search in databases
• Identify similarities between
•
a new test sequence, of
•
unknown and uncharacterized structure and function
• and
sequences in (public) sequence databases
with known structure and function.
•N.B. The similar regions can encompass the entire sequence or parts of it!
•
Local alignment  global alignment
1. Lecture WS 2003/04
Bioinformatics III
15
Sequence Alignment
The purpose of a sequence alignment is to arrange all those residues of a
deliberate number of sequences beneath eachother that are derived from
the same residue position in an ancestral gene or protein.
gap = Insertion oder Deletion
1. Lecture WS 2003/04
J.Leunissen
Bioinformatics III
Needleman-Wunsch Algorithm
- general algorithm for sequence comparison
- maximises a similarity score
- maximum match = largest number of residues of one sequence that can be
matched with another allowing for all possible deletions
- finds the best GLOBAL alignment of any two sequences
- NW involves an iterative matrix method of calculation
all possible pairs of residues (bases or amino acids) – one from each
sequence – are represented in a two-dimensional array
all possible alignments (comparisons) are represented by pathways
through this array.
Three main steps 1 initialization 2 fill (induction) 3 trace-back
1. Lecture WS 2003/04
Bioinformatics III
17
Needleman-Wunsch Algorithm: Initialization
task: align words “COELACANTH” and “PELICAN” of length m=10 and n=7.
Construct (m+1)  (n+1) matrix.
Assign values – m  gap and – n  gap to elements m and n of first row and
first column. Here, gap = -1.
Arrows of these fields point back to origin.
0
1. Lecture WS 2003/04
P
-1
E
-2
L
-3
I
-4
C
-5
A
-6
N
-7
C
O
E
L
A
C
A
N
T
H
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
Bioinformatics III
18
Needleman-Wunsch Algorithm: Fill
Fill all matrix fields with scores and pointers using a simple operation that requires
the scores from the diagonal, vertical, and horizontal neighboring cells. Compute
- match score: value of upper left diagonal cell + score for a match (+1 or -1)
- horizontal gap score: value of cell to the left + gap score (-1)
- vertical gap score: value of cell to the top + gap score (-1)
assign maximum of these 3 scores to cell. point arrow in direction of maximum
score.
0
max(-1, -2, -2) = -1
max(-2, -2, -3) = -2
P
-1
C
O
E
L
A
C
A
N
T
H
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
-1
-2
(make arbitrary, consistent choice – e.g. always choose the diagonal over a gap.
1. Lecture WS 2003/04
Bioinformatics III
19
Needleman-Wunsch Algorithm: Trace-back
trace-back lets you recover the alignment from the matrix.
start at the bottom-right corner and follow the arrows until you get to the
beginning.
C
1. Lecture WS 2003/04
E
L
A
C
A
N
T
H
0
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
P
-1
-1
-2
-3
-4
-5
-6
-7
-8
-9
-10
E
-2
-2
-2
-1
-2
-3
-4
-5
-6
-7
-8
L
-3
-3
-3
-2
0
-1
-2
-3
-4
-5
-6
I
-4
-4
-4
-3
-1
-1
-2
-3
-4
-5
-6
C
-5
-3
-4
-4
-2
-2
0
-1
-2
-3
-4
A
-6
-4
-4
-5
-3
-1
-1
1
0
-1
-2
N
-7
-5
-5
-5
-4
-2
-2
0
2
1
0
COELACANTH
-PELICAN--
O
Bioinformatics III
20
Smith-Waterman-Algorithm
Smith-Waterman is a local alignment algorithm. SW is a very simple
modification of Needleman-Wunsch. Only 3 changes:
- edges of the matrix are initialized to 0 instead of increasing gap penalties.
- maximum score is never less than 0. No pointer is recorded unless the score
is greater than 0.
- trace-back starts from highest score in matrix and ends at a score of 0.
ELACAN
C
ELICAN
1. Lecture WS 2003/04
O
E
L
A
C
A
N
T
H
0
0
0
0
0
0
0
0
0
0
0
P
0
0
0
0
0
0
0
0
0
0
0
E
0
0
0
1
0
0
0
0
0
0
0
L
0
0
0
0
2
1
0
0
0
0
0
I
0
0
0
0
1
1
0
0
0
0
0
C
0
1
0
0
0
0
2
0
0
0
0
A
0
0
0
0
0
1
0
3
2
1
0
N
0
0
0
0
0
0
0
1
4
3
2
Bioinformatics III
21
Differences
Needleman-Wunsch
Smith-Waterman
1 Global alignments
1 Local alignments
2 requires alignment score for a pair
2 Residue alignment score may be
of residues to be  0
positive or negative
3 no gap penalty required
3 requires a gap penalty to work
efficiently
more suited for alignment of
eukaryotic sequences with exons and
introns
1. Lecture WS 2003/04
Bioinformatics III
22
Algorithmic complexity
Dynamic programming methods such as Needleman-Wunsch and SmithWaterman have O(mn) complexity in both time and memory.
Variation: just use 2 rows at a time and don’t allocate the whole matrix.
The alignment algorithm becomes O(n) in memory.
1. Lecture WS 2003/04
Bioinformatics III
23
Scoring - or Substition Matrices
– serve to better score the quality of sequence alignments.
– for protein/protein comparison:
a 20 x 20 matrix for the probabilities that certain amino acids are
exchange others by random mutations
– the exchange of amino acids of similar character (Ile, Leu) is more
likely (receives higher score) than for exchanging amino acids of
dissimilar character (e.g. Ile  Asp)
– scoring matrices are assumed to be symmetrical (exchange Ile  Asp
has the same probability as Asp  Ile). Therefore they are triangular
matrices.
1. Lecture WS 2003/04
Bioinformatics III
24
Substitution matrices
•
Not all amino acids are similar
– some can be replaced more easily than others
– some mutations occur more frequently than others
– some mutations are more long-lived than others
•
Mutations prefer certain exchanges
– some amino acids have similar 3-letter codons
– those residues are more replaced by random DNA mutation
•
Selection prefers certain exchanges
– some amino acids have similar properties and structure
(E.g. Trp cannot be inserted in the protein interior.)
1. Lecture WS 2003/04
Bioinformatics III
25
PAM250 Matrix
1. Lecture WS 2003/04
Bioinformatics III
26
Example Score
The Score of an alignment is the sum of all invidual scores of the
amino acid (base) pairs of the alignment.
•
Sequence 1: TCCPSIVARSN
•
Sequence 2: SCCPSISARNT
•
1 12 12 6 2 5 -1 2 6 1 0
1. Lecture WS 2003/04
=> Alignment Score = 46
Bioinformatics III
27
Dayhoff Matrix (1)
– derived by M.O. Dayhoff who collected statistical data for probabilities of
amino acid exchanges
– data set for closely related protein sequences (> 85% identity).
– advantage: these can be aligned to high certainty.
– derive 20 x 20 matrix for probabilities of amino acid mutations from the
observed frequency of exchanges
– This matrix is called PAM 1. An evolutionary distance of 1 PAM (point
accepted mutation) means that 1 point mutations occur per 100 residues.
Or: both sequences are 99% identical.
1. Lecture WS 2003/04
Bioinformatics III
28
Dayhoff Matrix (2)
•
Log odds Matrix: contains logarithms of PAM matrix entries.
•
•
Score of mutation i  j
•
observed mutation rate i  j
= log(
expected mutation rate according to amino acid frequency
)
•
The probability of two independent mutational events is the product of
the individual probabilities.
•
When using a log odds Matrix (i.e. using the logarithm of all values) one
obtains the total alignment score as sum of the scores for every residue
pair.
1. Lecture WS 2003/04
Bioinformatics III
29
Dayhoff Matrix (3)
•
Derive Matrices for larger evolutionary distances by multiplying the
PAM1 matrix with itself.
•
PAM250:
– 2,5 mutations per residue
– corresponds to 20% matches between two sequences,
– i.e. mutations are observed at 80% of all residue positions.
– This is the default matrix of most sequence analysis packages.
1. Lecture WS 2003/04
Bioinformatics III
30
BLOSUM Matrix
•
limitation of Dayhoff-Matrix:
the matrices based on the Dayhoff model of evolutionary rates are of limited
value because the substitution rates were derived from sequence alignments
of sequences that are more than 85% identical.
•
A different path was taken by S. Henikoff and J.G. Henikoff who used local
multiple alignments of distantly related sequences.
•
Advantages:
-
larger data sets
-
multiple alignments are more robust
1. Lecture WS 2003/04
Bioinformatics III
31
BLOSUM Matrix (2)
• The BLOSUM matrices (BLOcks SUbstitution Matrix) are based on the BLOCKS
database.
• The BLOCKS database uses the concept of blocks (ungapped amino acid
signatures) that are characteristic for protein families.
• Derive probabilities of exchange for all amino acid pairs from the observed
mutations inside the blocks. Convert into log odds BLOSUM matrix.
• Different matrices are obtained by varying the lower requirement for the level of
sequence identity.
• e.g. the BLOSUM80 matrix is derived from blocks with > 80% identity.
1. Lecture WS 2003/04
Bioinformatics III
32
Which matrix to use?
•
Close relationship (low PAM, high Blosum)
Distant relationship (High PAM, low Blosum)
•
reasonable default parameters: PAM250, BLOSUM62
1. Lecture WS 2003/04
Gap penalties
• Besides substitution matrices we need a method to score gaps
• Which relevance do insertions or deletions have relative to substitutions?
•
distinguish introduction of gaps:
•
aaagaaa
•
aaa-aaa
•
from extension of gaps:
•
aaaggggaaa
•
aaa----aaa
• different programs (CLUSTAL-W, BLAST, FASTA) recommend different default
parameters which should be used as a first guess.
1. Lecture WS 2003/04
Significance of Alignments (1)
•
When is an alignment statistically significant?
In other words:
•
How different is the obtained score of an alignment from scores that
would result from alignments of the test sequence with random
sequences?
•
Or:
•
What is the probability that an alignment of this score occured
randomly?
1. Lecture WS 2003/04
Bioinformatics III
35
Significance of Alignments (2)
•
size of database = 20 x 106 letters
•
Peptide
#hits
•
A
1 x 106 (if equally distributed)
•
AP
50000
•
IAP
2500
•
LIAP
125
•
WLIAP
6
•
KWLIAP
0,3
•
KWLIAPY
0,015
1. Lecture WS 2003/04
Bioinformatics III
36
BLAST –
Basic Local Alignment Search Tool
• finds the highest-scored local optimal alignment of a test sequence with all
sequences of a database.
• Very fast algorithm. Ca. 50 times faster than dynamical programming.
• because BLAST uses pre-indexed database, BLAST can be used to search
very large databases.
• is sufficiently sensitive and selective for most purposes.
• Is robust – default parameters usually work fine.
1. Lecture WS 2003/04
Bioinformatics III
37
BLAST Algorithm, Step 1
•
For given word of length w (usually 3 for proteins) and for a given scoring matrix
•
construct list of all words (w-mers) which get score > T if compared to w-mer of
input sequence.
test sequence
LNKCKTPQGQRLVNQ
P Q G 18
P E G 15
P R G 14
P K G 14
P N G 13
P D G 13
P M G 13
below
cut-off
(T=13)
1. Lecture WS 2003/04
Bioinformatics III
word
related
words
P Q A 12
P Q N 12
etc.
38
BLAST Algorithm, Step 2
•
each related word points to positions in data base (hit list).
P Q G 18
P E G 15
P R G 14
P K G 14
P N G 13
P D G 13
P M G 13
1. Lecture WS 2003/04
PMG
Bioinformatics III
Database
39
BLAST Algorithm, Step 3
•
Program tries to extend suitable segments (seeds) in both directions by adding
pairs of residues.
•
Residues are added until score sinks below cut-off.
1. Lecture WS 2003/04
Bioinformatics III
40
different BLAST algorithms
•
BLASTN – compares nucleotide sequence against nucleotide database
•
BLASTP – compares protein sequence against protein database
•
BLASTX – compares nucleotide sequences translated in all 6 open reading
frames against protein sequence database
•
TBLASTN
•
TBLASTX
1. Lecture WS 2003/04
Bioinformatics III
41
BLAST Output (1)
•
1. Lecture WS 2003/04
Small probability
shows that hit is
likely not random
Bioinformatics III
42
Significance of BLAST alignment
P-value (probability)
– probability that alignment score could result from alignment of random
sequences
– the closer P equals 0, the higher is certainty that a hit is a true hit
(homologous sequence)
E-value (expectation value)
– E = P * number of sequences in database
– E is the number of alignments of a particular score that can be
expected to occur randomly in a sequence database of this size
– if e.g. E=10, one expects 10 random hits with the same score. Such an
alignment is not significant. Use appropriate threshold in BLAST.
1. Lecture WS 2003/04
Bioinformatics III
43
Rough guide
•P-value (probability) – A. M. Lesk
– P  10-100
sequences are identical
– 10-100 < P < 10-50
sequences are almost identical, e.g. alleles
or SNPs
– 10-50 < P < 10-10
closely related sequences,
homology is certain
– 10-10 < P < 10-1
sequences are usually distantly related
– P > 10-1
similarity probably not significant
•E-value (expectation value)
• E  0,02
sequences probably homologous
• 0,02 < E < 1
Homology possible
• E1
good agreement most likely random.
1. Lecture WS 2003/04
Bioinformatics III
44
Rough guide
Level of sequence identity with optimal alignment
> 45%
proteins have very similar structure

> 25%
18 – 25%
below
1. Lecture WS 2003/04
and most likely the same function
proteins probably possess similar fold
Twilight-Zone
- assuming homology is tempting
alignment has little significance
Bioinformatics III
45
Twilight-Zone (1)
• myoglobin from whale and Leghemoglobin of lupins are 15% identical with
optimal alignment
• both have very similar tertiary structure.
• both contain heme group and bind oxygen
• they are remotely related, though homologous proteins
Left:
Whale Mb
Right:
Leg Hb
www.rcsb.org
1. Lecture WS 2003/04
Bioinformatics III
46
Twilight-Zone (2)
• the N- and C-terminal halfs of thiosulfate-sulfate-transferase have 11% sequence
identity.
• Because they belong to the same protein  assumption that they resulted from gene
duplication and divergent evolution.
• Indeed both 3D structures show large similarity.
2ORA
Rhodanese
www.rcsb.org
1. Lecture WS 2003/04
Bioinformatics III
47
Twilight-Zone (3)
• serine proteases chymotrypsin and subtilisin
• have 12% identity with optimal alignment
• both have same function, same catalytic triad of 3 amino acids (Ser – His – Asp)
• However, the two folds are completely different and the proteins are not related.
Example for convergent evolution.
Left: 1AB9Bovine
Chymo
Trypsin
Right: 1GCI
Bacillus Lentus
Subtilisin
www.rcsb.org
1. Lecture WS 2003/04
Bioinformatics III
48
Summary
Pairwise alignment of sequences is routine but not trivial.
Dynamic programming guarantees finding the alignment with optimal score
(Smith-Waterman, Needleman-Wunsch).
Much faster but reliable tools are: FASTA, (PSI) BLAST
Deeper functional insight into sequences and relationships from
multiple sequence alignments (see lecture on phylogenies).
1. Lecture WS 2003/04
Bioinformatics III
49
Growth of Proteomic Data vs. Sequence Data
1000
100
PetaBytes
10
1
0.1
0.01
0.001
Proteomic data
0.0001
GenBank
19
88
19
90
19
92
19
94
19
96
19
98
20
00
20
02
20
04
20
06
20
08
20
10
20
12
20
14
20
16
0.00001
Years
1. Lecture WS 2003/04
Bioinformatics III
50
Systems biology
Systems biology is an emergent field that aims at system-level understanding
of biological systems.
Cybernetics, for example, aims at describing animals and machines from the
control and communication theory. Unfortunately, molecular biology had just
started at that time, so that only phenomenological analysis has been possible.
With the progress of genome sequence project and range of other molecular
biology project that accumulate in-depth knowledge of molecular nature of
biological system, we are now at the stage to seriously look into possibility of
system-level understanding solidly grounded on molecular-level
understanding.
http://www.systems-biology.org/000/
1. Lecture WS 2003/04
Bioinformatics III
51
Systems biology
What does it mean to understand at "system level"?
Unlike molecular biology which focusses on molecules, such as the sequences
of nucleotide acids and proteins, systems biology focusses on systems that
are composed of molecular components.
Although systems are composed of matters, the essence of systems lies in the
dynamics and cannot be described merely by enumerating components of the
system.
At the same time, it is misleading to believe that only the system structure,
such as network topologies, is important without paying sufficient attention to
diversities and functionalities of components.
Both the structure of the system and its components play indispensable roles
forming a symbiotic state of the system as a whole.
http://www.systems-biology.org/000/
1. Lecture WS 2003/04
Bioinformatics III
52
Systems biology
Key milestones are:
(1) understanding of structure of the system, such as gene regulatory and
biochemical networks, as well as physical structures,
(2) understanding of dynamics of the system, both quantitative and qualitative
analysis as well as construction of theory/models with powerful prediction
capability,
(3) understanding of control methods of the system, and
(4) understanding of design methods of the system.
There are numbers of exciting and profound issues that are actively
investigated, such as robustness of biological systems, network structures and
dynamics, and applications to drug discovery. Systems biology is in its infancy,
but this is the area that has to be explored and the area that we believe to be
the main stream in biological sciences in this century.
http://www.systems-biology.org/000/
1. Lecture WS 2003/04
Bioinformatics III
53
Systems Biology
Nat. Biotech. Nov. 2000, 1147
1. Lecture WS 2003/04
Bioinformatics III
54
From Genomics to Genetic Circuits
The relationship between the genotype and the phenotype is complex, highly
non-linear and cannot be predicted from simply cataloging and assigning gene
functions to genes found in a genome.
http://gcrg.ucsd.edu/presentations/hougen/l2.pdf
1. Lecture WS 2003/04
Bioinformatics III
55
Genetic Circuits  Engineering
1. Lecture WS 2003/04
Bioinformatics III
56
Analysis of Genetic Circuits
1. Lecture WS 2003/04
Bioinformatics III
57
Reconstructing Metabolic Networks
1. Lecture WS 2003/04
Bioinformatics III
58
Translating Biochemistry into Linear Algebra
1. Lecture WS 2003/04
Bioinformatics III
59
DOE initiative: Genomes to Life
a coordinated effort
slides borrowed
from talk of
Marvin Frazier
Life Sciences Division
U.S. Dept of Energy
1. Lecture WS 2003/04
Bioinformatics III
60
Facility I
Production and Characterization of Proteins
Estimating Microbial Genome Capability
• Computational Analysis
– Genome analysis of genes, proteins, and operons
– Metabolic pathways analysis from reference data
– Protein machines estimate from PM reference data
• Knowledge Captured
– Initial annotation of genome
– Initial perceptions of pathways and processes
– Recognized machines, function, and homology
– Novel proteins/machines (including
prioritization)
– Production conditions and experience
1. Lecture WS 2003/04
Bioinformatics III
61
Facility II
Whole Proteome Analysis
Modeling Proteome Expression, Regulation, and Pathways
•
Analysis and Modeling
– Mass spectrometry expression analysis
– Metabolic and regulatory pathway/ network
analysis and modeling
•
Knowledge Captured
– Expression data and conditions
– Novel pathways and processes
– Functional inferences about novel
proteins/machines
– Genome super annotation: regulation, function,
and processes (deep knowledge about cellular
subsystems)
1. Lecture WS 2003/04
Bioinformatics III
62
Facility III
Characterization and Imaging of Molecular Machines
Exploring Molecular Machine Geometry and Dynamics
•
Computational Analysis, Modeling and Simulation
– Image analysis/cryoelectron microscopy
– Protein interaction analysis/mass spec
– Machine geometry and docking modeling
– Machine biophysical dynamic simulation
•
Knowledge Captured
– Machine composition, organization, geometry,
assembly and disassembly
– Component docking and dynamic simulations
of machines
1. Lecture WS 2003/04
Bioinformatics III
63
Facility IV
Analysis and Modeling of Cellular Systems
Simulating Cell and Community Dynamics
•
•
Analysis, Modeling and Simulation
– Couple knowledge of pathways, networks, and
machines to generate an understanding of
cellular and multi-cellular systems
– Metabolism, regulation, and machine
simulation
– Cell and multicell modeling and flux
visualization
Knowledge Captured
– Cell and community measurement data sets
– Protein machine assembly time-course data sets
– Dynamic models and simulations of cell processes
1. Lecture WS 2003/04
Bioinformatics III
64
Computing and Information
Infrastructure Capabilities
GTL Computing Roadmap
Protein machine
Interactions

Molecule-based
cell simulation
Molecular machine
classical simulation
Cell, pathway, and
network
simulation
Community metabolic
regulatory, signaling simulations
Constrained
rigid
docking
Constraint-Based
Flexible Docking
Current
U.S.
Computing
Genome-scale
protein threading

Comparative
Genomics
Biological Complexity
1. Lecture WS 2003/04
Bioinformatics III
65