Download Ka/Ks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Endogenous retrovirus wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Community fingerprinting wikipedia , lookup

Molecular ecology wikipedia , lookup

RNA-Seq wikipedia , lookup

Point mutation wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
Introduction to bioinformatics
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Lecture 3
High-throughput Biological
Data
-data deluge, bioinformatics algorithms-
and evolution
Last lecture:
• Many different genomics datasets:
– Genome sequencing: more than 300 species completely
sequenced and data in public domain (i.e. information
is freely available), virus genome can be sequenced in a
day
– Gene expression (microarray) data: many microarrays
measured per day
– Proteomics: Protein Data Bank (PDB) - as of Tuesday
February 07, 2006 there are 35026 Structures.
http://www.rcsb.org/pdb/
– Protein-protein interaction data: many databases
worldwide
– Metabolic pathway, regulation and signaling data, many
databases worldwide
Growth in number of protein
tertiary structures
The data deluge
Although a lot of tertiary structural data is being
produced (preceding slide), there is the
SEQUENCE-STRUCTURE-FUNCTION GAP
The gap between sequence data on the one hand, and
structure or function data on the other, is widening
rapidly: Sequence data grows much faster
High-throughput Biological Data
The data deluge
• Hidden in all these data classes is
information that reflects
– existence, organization, activity,
functionality …… of biological machineries
at different levels in living organisms
Most effectively utilising and analysing this
information computationally is essential for
Bioinformatics
Data issues: from data to
distributed knowledge
• Data collection: getting the data
• Data representation: data standards, data normalisation …..
• Data organisation and storage: database issues …..
• Data analysis and data mining: discovering “knowledge”,
patterns/signals, from data, establishing associations among
data patterns
• Data utilisation and application: from data patterns/signals to
models for bio-machineries
• Data visualization: viewing complex data ……
• Data transmission: data collection, retrieval, …..
• ……
Bio-Data Analysis and Data Mining
• Analysis and mining tools exist and are developed for:
– DNA sequence assembly
– Genetic map construction
– Sequence comparison and database searching
– Gene finding
– Gene expression data analysis
– Phylogenetic tree analysis, e.g. to infer horizontallytransferred genes
– Mass spectrometry data analysis for protein complex
characterization
– ……
Bio-Data Analysis and Data Mining
• As the amount and types of data and their
cross connections increase rapidly
• the number of analysis tools needed will go up
“exponentially” if we do not reuse techniques
– blast, blastp, blastx, blastn, … from BLAST family
of tools (we will cover BLAST later)
– gene finding tools for human, mouse, fly, rice,
cyanobacteria, …..
– tools for finding various signals in genomic
sequences, protein-binding sites, splice junction
sites, translation start sites, …..
Bio-Data Analysis and Data Mining
Many of these data analysis problems are
fundamentally the same problem(s) and can
be solved using the same set of tools
e.g.
•clustering or
•optimal segmentation by Dynamic
Programming
We will cover both of these techniques in later lectures
Bio-data Analysis, Data
Mining and Integrative
Bioinformatics
To have analysis capabilities covering a wide
range of problems, we need to discover the
common fundamental structures of these
problems;
HOWEVER in biology one size does NOT fit all…
An important goal of bioinformatics is
development of a data analysis
infrastructure in support of Genomics and
beyond
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE (oligomers)
TERTIARY STRUCTURE (fold)
Protein complexes for photosynthesis in plants
Protein folding problem
Each protein sequence “knows”
how to fold into its tertiary
structure. We still do not
understand exactly how and why
PRIMARY STRUCTURE (amino acid sequence)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
SECONDARY STRUCTURE (helices, strands)
1-step
process
2-step
process
TERTIARY STRUCTURE (fold)
The 1-step process is based on a
hydrophobic collapse; the 2-step
process, more common in forming
larger proteins, is called the
framework model of folding
Protein folding: step on the way
is secondary structure prediction
• Long history -- first widely used algorithm was
by Chou and Fasman (1974)
• Different algorithms have been developed over
the years to crack the problem:
–
–
–
–
Statistical approaches
Neural networks (first from speech recognition)
K-nearest neighbour algorithms
Support Vector machines
Algorithms in bioinformatics
(recap)
• Sometimes the same basic algorithm can be
re-used for different problems (1-methodmultiple-problem)
• Normally, biological problems are
approached by different researchers using a
variety of methods (1-problem-multiplemethod)
Algorithms in bioinformatics
• string algorithms
• dynamic programming
• machine learning (Neural Netsworks, k-Nearest Neighbour,
Support Vector Machines, Genetic Algorithm, ..)
• Markov chain models, hidden Markov models, Markov
Chain Monte Carlo (MCMC) algorithms
• molecular mechanics, e.g. molecular dynamics, Monte
Carlo, simplified force fields
• stochastic context free grammars
• EM algorithms
• Gibbs sampling
• clustering
• tree algorithms
• text analysis
• hybrid/combinatorial techniques and more…
Sequence analysis and homology searching
Finding genes and regulatory elements
There are many different regulation signals such as start, stop and skip
messages hidden in the genome for each gene, but what and where are they?
Expression data
Functional genomics
• Monte Carlo
Protein translation
What is life?
• NASA astrobiology program:
“Life is a self-sustained chemical system
capable of undergoing Darwinian
evolution”
Evolution
Four requirements:
• Template structure providing stability (DNA)
• Copying mechanism (meiosis)
• Mechanism providing variation (mutations;
insertions and deletions; crossing-over; etc.)
• Selection: some traits lead to greater fitness of one
individual relative to another. Darwin wrote
“survival of the fittest”
Evolution is a conservative process: the vast majority of mutations
will not be selected (i.e. will not make it as they lead to worse
performance or are even lethal) – this is called negative (or
purifying) selection
Orthology/paralogy
Orthologous genes are homologous
(corresponding) genes in different
species
Paralogous genes are homologous genes
within the same species (genome)
Changing molecular sequences
• Mutations: changing nucleotides (‘letters’)
within DNA, also called ‘point mutations’
• A & G: purines, C & T/U: pyrimidines:
– Transition: purine -> purine or pyrimidine ->
pyrimidine
– Transversion: purine -> pyrimidine or
pyrimidine -> purine
Types of point mutation
• Synonymous mutation: mutation that does
not lead to an amino acid change (where in
the codon are these expected?)
• Non-synonymous mutation: does lead to
an amino acid change
– Missense mutation: one a.a replaced by other
a.a
– Nonsense mutation: a.a. replaced by stop
codon (what happens with protein?)
Ka/Ks Ratios
• Ks is defined as the number of synonymous
nucleotide substitutions per synonymous site
• Ka is defined as the number of nonsynonymous
nucleotide substitutions per nonsynonymous site
• The Ka/Ks ratio is used to estimate the type of
selection exerted on a given gene or DNA
fragment
• Need aligned orthologous sequences to do
calculate Ka/Ks ratios (we will talk about
alignment later).
Ka/Ks ratios
The frequency of different values of Ka/Ks for 835 mouse–rat
orthologous genes. Figures on the x axis represent the middle figure of
each bin; that is, the 0.05 bin collects data from 0 to 0.1
Ka/Ks ratios
Three types of selection:
1. Negative (purifying) selection -> Ka/Ks < 1
2. Neutral selection (Kimura) -> Ka/Ks ~= 1
3. Positive selection -> Ka/Ks > 1
Human Evolution
Divergent Evolution
Ancestral sequence: ABCD
ACCD (B C)
ACCD
AB─D
or
ABD (C ø)
mutation
deletion
ACCD
A─BD
Pairwise Alignment
Evolution
Ancestral sequence: ABCD
ACCD (B C)
ACCD
AB─D
true alignment
or
ABD (C ø)
mutation
deletion
ACCD
A─BD
Pairwise Alignment
Consequence of evolution
• Notion of comparative analysis (Darwin)
• What you know about one species might be
transferable to another, for example from
mouse to human
• Provides a framework to do the multi-level
large-scale analysis of the genomics data
plethora
Flavodoxin-cheY Multiple Sequence Alignment
Human
Yeast
We need to be able to
do automatic pathway
comparison (pathway
alignment)
This pathway diagram shows a comparison of pathways in (left) Homo
sapiens (human) and (right) Saccharomyces cerevisiae (baker’s yeast).
Changes in controlling enzymes (square boxes in red) and the pathway
itself have occurred (yeast has one altered (‘overtaking’) path in the
graph)
The citric-acid cycle
http://en.wikipedia.org/wiki/Krebs_cycle
The citric-acid cycle
M. A. Huynen, T. Dandekar and P. Bork
``Variation and evolution of the citric acid cycle: a
genomic approach'' Trends Microbiol, 7, 281-29
(1999)
Fig. 1. (a) A graphical representation of the reactions of the
citric-acid cycle (CAC), including the connections with
pyruvate and phosphoenolpyruvate, and the glyoxylate shunt.
When there are two enzymes that are not homologous to
each other but that catalyse the same reaction (nonhomologous gene displacement), one is marked with a solid
line and the other with a dashed line. The oxidative direction
is clockwise. The enzymes with their EC numbers are as
follows: 1, citrate synthase (4.1.3.7); 2, aconitase (4.2.1.3); 3,
isocitrate dehydrogenase (1.1.1.42); 4, 2-ketoglutarate
dehydrogenase (solid line; 1.2.4.2 and 2.3.1.61) and 2ketoglutarate ferredoxin oxidoreductase (dashed line;
1.2.7.3); 5, succinyl- CoA synthetase (solid line; 6.2.1.5) or
succinyl-CoA–acetoacetate-CoA transferase (dashed line;
2.8.3.5); 6, succinate dehydrogenase or fumarate reductase
(1.3.99.1); 7, fumarase (4.2.1.2) class I (dashed line) and
class II (solid line); 8, bacterial-type malate dehydrogenase
(solid line) or archaeal-type malate dehydrogenase (dashed
line) (1.1.1.37); 9, isocitrate lyase (4.1.3.1); 10, malate
synthase (4.1.3.2); 11, phosphoenolpyruvate carboxykinase
(4.1.1.49) or phosphoenolpyruvate carboxylase (4.1.1.32);
12, malic enzyme (1.1.1.40 or 1.1.1.38); 13, pyruvate
carboxylase or oxaloacetate decarboxylase (6.4.1.1); 14,
pyruvate dehydrogenase (solid line; 1.2.4.1 and 2.3.1.12) and
pyruvate ferredoxin oxidoreductase (dashed line; 1.2.7.1).
The citric-acid cycle
b) Individual species might not
have a complete CAC. This
diagram shows the genes for the
CAC for each unicellular species
for which a genome sequence has
been published, together with the
phylogeny of the species. The
distance-based phylogeny was
constructed using the fraction of
genes shared between genomes
as a similarity criterion29. The
major kingdoms of life are
indicated in red (Archaea), blue
(Bacteria) and yellow (Eukarya).
Question marks represent
reactions for which there is
biochemical evidence in the
species itself or in a related
species but for which no genes
could be found. Genes that lie in a
single operon are shown in the
same color. Genes were assumed
to be located in a single operon
when they were transcribed in the
same direction and the stretches
of non-coding DNA separating
them were less than 50
nucleotides in length.
M. A. Huynen, T. Dandekar and P. Bork ``Variation and evolution of the citric acid cycle: a genomic approach'' Trends Microbiol, 7, 281-29
(1999)
Thinking about evolution
• Is the evolutionary model applicable to other
systems?
– Story telling in old cultures
– Richard Dawkins’ book entitled A Selfish Gene talks
about Memes
• The Genetic Algorithm (GA) is arguably the best
computational optimisation strategy around, and is
based entirely on Darwinian evolution