Download Day6

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extrachromosomal DNA wikipedia , lookup

DNA vaccination wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Koinophilia wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Microevolution wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
• With astonishing advance of the Human Genome Project,
essentially all human genomic sequences are available in
public databases. The major task for the entire scientific
community is to identify medically important genes and
determine their functions. Discovery and characterization
of corin, the first transmembrane serine protease identified
from the heart, exemplifies such a challenge. The
bioinformatic and biochemical approaches used in our
studies can be applied to study many other genes.
• Serine proteases are important for a variety of biological
processes including food digestion, blood coagulation, host
defense and embryonic development. These proteases are
also protein targets of pharmaceutical drugs. For example,
inhibitors of blood clotting enzymes such as thrombin and
factor X are developed to prevent and treat thrombotic
diseases.
• To identify novel serine proteases in the cardiovascular
system, we used the BLAST program to search genomic
databases for new genes that share significant homology
with serine protease family members, such as trypsin. A
partial cDNA sequence (EST) was identified from a human
heart library and subsequently used to clone the full-length
cDNA of a novel gene, designated corin for its abundant
expression in the heart..
• Sequence analysis indicates that human corin cDNA
encodes a polypeptide of 1042 amino acids. Near the
amino terminus of corin, there is a transmembrane
domain identified by hydropathy plots using the GCG
program. In the extracellular region, corin contains
two frizzled-like cysteine-rich motifs, seven low
density lipoprotein receptor repeats, a macrophage
scavenger receptor-like domain, and a trypsin-like
protease domain. Such a unique mosaic domain
structure was never found in any of the trypsin
superfamily members
• To study the function of corin, we performed series
of biochemical experiments. Using combined
bioinformatic and biochemical approaches, we
have solved a long-standing puzzle in the
cardiovascular biology.
4951 ProjectsCharacterize a protein. Carefully select a protein. Suggestions:
select a protein that is in a 3D database
select a protein that has been studied in many organisms
select a protein that has a known activity
select a protein for which there are known
mutant versions.
Use the library to research your chosen protein.
Demonstrate how its function is related to its structure.
Comment on the evolution of the protein.
Comment on the difference between the normal and mutant versions.
Find and analyze the DNA of the protein (determine its size, its intron #,
chromosomal location, etc.)
During your presentation, indicate the names of the software used and
databases analyzed to give you your information.
Introduction to Molecular
Phylogeny*
*Phylogeny- the evolutionary history
of a group
Requirement:
• Basic understanding of evolutionary
principles.
• Basic understanding of mutation at the
molecular level
Genetic variation exists.
Evolution depends on it.
• Genetic variation:
DNA segments (large or small) can be altered or
duplicated or deleted.
Point mutations or other small changes (ex. A  G)
generate a new version of a gene (i.e. a new allele)
New loci are generated by gene duplication events.
Basis of Molecular
Phylogenetics
• To a first approximation, the evolution of species or
genes can be modeled as a bifurcating process. Two
populations become reproductively isolated and
diverge due to random mutational processes. Over
time, this process may repeat itself, so that at any
time, each population can be said to be most closelyrelated to some other population with which it
shares a direct common ancestor.
Basis of Molecular
Phylogenetics
• If genomes evolve the by gradual
accumulation of mutations, then the
amount of nucleotide sequence
difference between a pair of genomes
should indicate how recently those
two genomes shared a common
ancestor.
Basis of Molecular
Phylogenetics
• Divergence consists of changes in characters,
such as amino acids in a protein, or
nucleotides in DNA. The longer two
populations remain reproductively isolated,
the more divergence will occur. Given the
existence of homologous characters across a
set of populations, it should be possible to
work backwards in time, ascending the tree,
until a common ancestor of all populations in
the set is reached.
Word of Caution
• Phylogenetic analysis is one of the
most controversial areas in
bioinformatics. There are a wide
variety of different methods for
analyzing the data, and even the
experts often disagree on the best
method for analyzing the data.
Phylogenetic Data Analysis
requires 4 steps (text- starting on page 327)
• 1) Alignment
• 2) Determine the substitution model
• 3) Tree Building
• 4) Tree Evaluation
Alignment
• Phylogenetic Analyses is very
dependent on a good multiple
alignment. The alignment of
sequences can often have more of
an impact on the final tree than the
choice of phylogenetic software or
phylogenetic parameters.
Homology
It is critical to phylogenetic analysis that
homologous characters be compared across
species. For DNA and proteins, this means that
gaps must be correctly in multiple alignments
to ensure that the same position is being
compared for each species. Consequently, if a
multiple alignment is poor, phylogeny
construction will also be poor.
What to align?
• Phylogenetic trees are
generated by comparing
DNA, RNA, or protein. The
molecule of choice depends
on the question you are
attempting to answer.
DNA/RNA
• contains more evolutionary
information than protein
• high rate of base substitution
makes DNA test for very short
term studies e.g.. closelyrelated species
Protein
• more reliable alignment than
DNA (DNA- 25% = random)
• fewer homoplasies* than DNA
• lower rate of substitution than
DNA; better for wide species
comparisons
*Homoplasy
• Return of a character to its original
state, thus masking intervening
mutational events. Homoplasies
are most important in DNA
sequences, because there are only 4
nucleotides. Every fourth mutation
should result in a homoplasy.
rRNA= ribosomal RNA
• Best for very long term evolutionary
studies spanning biological kingdoms
• Most consistent with an evolutionary
clock.
• Selective processes constraining
sequence evolution should be roughly
the same across species boundaries
Determine the substitution modelDNA:
• May be a nucleotide substitution rate matrix:
A
C
G
T
A
-
2
1
2
C
2
-
2
1
G
1
2
-
2
T
2
1
2
-
Mutation Rates Vary:
• Transitions (purine to purine or
pyrimidine to pyrimidine) occur
more frequently than
transversions (purine to
pyrimidine or pyrimidine to
purine).
• In general, DNA distance matrices are
calculated such that each mismatch
between two sequences adds to the
distance, and each identity subtracts
from the distance. Scoring matrices
include values for all possible
substitutions.
Determine the substitution model
• May be an amino acid
substitution rate matrix such as
PAM or BLOSUM.
Tree Building
• There are four main tree drawing
methods.
•
•
•
•
- pairwise distance
- neighbor joining
- maximum parsimony
- maximum likelihood
Basic tree terminology:
Nodes: branching points
Branches: lines
Topology: branching pattern
Branches can be rotated at a node, without
changing the relationships.
Phylogenetic trees based on
pairwise distance.
Simplest to visualize with DNA data:
1) Align each pair of sequences under consideration
2) The two sequences that are closest together are
connected at a node. The branch lengths reflect
the degree of similarity (and theoretically reflect
evolutionary time).
3) The process is repeated until all sequences are
joined.
4) Addition of the last sequence defines the root of
the tree.
Phylogenetic trees based on
pairwise distance.
• Relatively simple.
• Problem:
–May not be accurate!!
Phylogenetic trees based on
neighbor joining.
• Also utilizes a ‘distance matrix’
• Neighbor joining algorithm searches
for sets of neighbors that minimize
the total length of the tree.
• Can produce reasonable trees,
especially when evolutionary
distances are short.
Pairwise distance and neighbor
joining are distance methods.
• There are two main categories of phylogeny
methods, distance methods and character
methods. In distance methods, the first step
is to calculate a matrix of all pairwise
differences between a set of sequences.
Next, the tree is constructed to minimize
the distance when all branches are added
together.
Maximum parsimony and
maximum likelihood are
character methods
• Character methods attempt to reconstruct
ancestral nodes of trees in order to fit the
tree to an evolutionary model. They
therefore use more of the information in
the data, at the expense of longer
execution time.
Phylogenetic trees based on
maximum parsimony
First step in maximum parsimony
analysis:
Identify all of the informative sites.
Parsimony Analysis 2nd step: Calculate the
minimum number of substitutions at each
informative site
1 step
2 steps
2 steps
Final step in
parsimony analysis:
After sequences are aligned,
algorithms model each tree:
Sum the number of changes
over all informative sites for
each possible tree.
Parsimony:
General scientific criterion for choosing
among competing hypotheses states that
we should accept the hypothesis that
explains the data most simply and
efficiently.
• The tree requiring the _______ number
of nucleic acid or amino acid substitutions
is selected.
Problem- As the # of sequences increases, the #
of possible trees increases dramatically
# of sequences
# of trees
3
4
5
6
7
8
9
10
50
1
3
15
105
945
10,395
135,135
1,027,025
2.8 x 1074
Programs take shortcuts.
• When a large number of tree is being
compared, it is impossible to score
each tree. A shortcut algorithm
establishes an upper limit. As it
evaluates other trees, it throws out any
tree exceeding the upper bound before
the calculation is completed.
Phylogenetic trees based
on maximum likelihood
Also evaluates every possible
tree topology. ML methods
are probabilistic. They
assign probabilities to every
possible evolutionary change
at informative sites.
Phylogenetic trees based
on maximum likelihood
The aim is to find the tree
(among all possible trees)
with the highest L
(likelihood) value.
Tree Evaluation
Bootstrap method of assessing tree
reliability:
Inferred tree is constructed from data set.
Characters are resampled from the data set
with replacement.
Resampling is repeated several (100-1000)
times.
Bootstrap method
Bootstrap trees are constructed from the
resampled data sets.
Bootstrap tree is compared to original
inferred tree.
% of bootstrap trees supporting a node
are determined for each node in the
tree.
Why the controversy??
• Molecular vs. Classical
• Different Methods  Same Tree??
• Molecular Clock
The End