Download TREE CONSTRUCTION

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA barcoding wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Community fingerprinting wikipedia , lookup

Molecular evolution wikipedia , lookup

Molecular ecology wikipedia , lookup

Transcript
TREE CONSTRUCTION
1. Definition of a phylogenetic tree
2. Features of a phylogenetic tree
 Branches
 Nodes (External & Internal)
3. Unrooted trees
4. Rooted trees
 Choice of an outgroup
5. Inferred and true trees
6. Gene trees are not the same as species trees
7. Tree reconstruction
7.1 Molecular sequences
7.2 Sequence alignment is the essential preliminary to tree reconstruction
7.3 Converting the alignment data into a phylogenetic tree
7.4 Assessing accuracy of a reconstructed tree
7.5 Molecular clocks enable the time of divergence of ancestral sequences to be
estimated
8. The applications (examples) of molecular phylogenetics
8.1 Clarifying evolutionary relationships between humans & other primates
8.2 The oldest life on earth (rRNA and phylogeny)
8.3 The origin of Aids
8.4 Problems with prion
9. Molecular phylogeny as a tool of the study of human prehistory
9.1 Intraspecific studies require highly variable genetic loci
9.2 The origins of modern humans - out of Africa or not?
9.3 The pattern of more recent migration into Europe are also controversial
9.4 Prehistoric human migration into the new world
10. Exercises for drawing a phylogenetic tree
11. Software packages for reconstruction of phylogenetic trees
1. Definition of a phylogenetic tree
A tree is an acyclic connected graph that consists of a collection of nodes (internal and external)
and branches connecting them so that every node can be reached by a unique path from every
other branch.
A
Branches
C
External nodes
B
Internal nodes
D
Figure: An unrooted phylogenetic tree joining 4 taxonomic units.
2. Features of a phylogenetic tree
In the area of phylogenetic inference, trees are used as visual displays that represent
hypothetical, reconstructed evolutionary events. The tree in this case consists of:
 internal nodes which represent taxonomic units such as species or genes; the external nodes,
those at the ends of the branches, represent living organisms.
 The lengths of the branches usually represent an elapsed time, measured in years, or the
length of the branches may represent number of molecular changes (e.g. mutations) that
have taken place between the two nodes. This is calculated is from the degree of differences
when sequences are compared (refer to “alignments” later)
 Sometimes, the lengths are irrelevant and the tree represents only the order of evolution. [In
a dendrogram, only the lengths of horizontal (or vertical, as the case may be) branches
count].
 Finally the tree may be rooted or unrooted.
3. Unrooted trees
An unrooted tree simply represents phylogenetic but doesnot provide an evolutionary path. In
an unrooted tree, an external node represents a contemporary organism. Internal nodes
represent common ancestors of some of the external nodes. In this case, the tree shows the
relationship between organisms A, B, C & D and does not tell us anything about the series of
evolutionary events that led to these genes (see figure above). There is also no way to tell
whether or not a given internal node is a common ancestor of any 2 external nodes.
4. Rooted trees
Gene trees are not the same as species trees
In case of a rooted tree, one of the internal nodes is used as an outgroup, and, in essence,
becomes the common ancestor of all the other external nodes. The outgroup therefore enables
the root of a tree to be located and the correct evolutionary pathway to be identified. In the
above case, five different evolutionary pathways are possible using an outgroup, each depicted
by a different rooted tree.
C
C
D
1
B
A
4
3
2
A
B
A
B
5
Unrooted tree
1
B
D
2
C
A
B
D
D
C
A
A
D
B
C
D
C
3
5
4
Figure. The five rooted trees that can be drawn from the unrooted tree (box). The positions
of the roots are indicated by the number on the outline of the unrooted tree (box)
5. Inferred and true trees
The criteria used to choose an outgroup depends very much on the type of analysis that is
carried out. Suppose that 4 homologous (orthologous) genes in a tree come from human,
chimpanzee, gorilla and orangutan. A useful homologous primate outgroup sequence is that
from baboon as palaeontological evidence suggests that baboons branched away from the
lineage leading to human, chimpanzee, gorilla and orangutan before the time of the common
ancestor of the four species (figure below).
Human
Chimpanzee
Gorilla
Orangutan
Baboon
Figure: The use of an outgroup to root a phylogenetic tree.
We refer to the rooted tree given above, as an inferred tree. This is to emphasise that it depicts the
series of evolutionary events that are inferred from the data that were analysed, and may not
necessarily be the same as the true tree, the one that depicts the actual series of events that
occurred. Sometimes we can be fairly confident that the inferred tree is the true tree but most
phylogenetic data analysis are prone to uncertainties. Degrees of confidence can be assigned to the
branching patterns in an inferred tree using bootstrap analysis (discussed in a later section). Due to
the imprecise nature of phylogenetic analysis controversies have arisen.
6. Gene trees are not the same as species trees
The above tree is a gene tree i.e. a tree derived by comparing orthologous sequences (those
derived from the same ancestral sequence). The assumption is that this gene tree is a more accurate
reflection of a species tree than the one that can be inferred from morphological data. This
assumption is generally correct but it does not mean that the gene tree is the same as a species tree.
Mutation and speciation are not expected to occur at the same time. For example, the mutation
event could precede the speciation event. This would mean that, to begin with, both alleles will
still be present in the same unsplit population of the ancestral species. When the population split
occurs, it is likely that both alleles will be present in each of the resulting groups. After the split,
the new population evolve independently. One possibility is that as a result of random genetic drift
loss of one allele from one population and the loss of the other allele from the second population
occurs. This establishes the two separate genetic lineages that were inferred from phylogenetic
analysis of the gene. How do these considerations affect the coincidence between a gene and a
species tree?
(a) If a molecular clock is used to date the time at which gene divergence took place, than it cannot
be assumed that this is also the time of the speciation event. A significant difference between a
gene and a species event can exist though the species tree & gene tree look the same (see LHS
figure a below).
(b) If the first speciation event is followed closely by a second speciation event in one of the two
populations, then the branching order of the gene tree might be different to that of the species
tree. This can occur if the genes in the modern species are derived from alleles that had already
appeared before the first of the two speciation (RHS Figure, below)


Mutation
Mutation

Mutation
Speciation
Speciation
Speciation
Allele loss
A
B
A
B A
B
Bb
Gene tree & species tree look the same. However,
mutation might precede speciation giving an
incorrect time for the latter if a molecular clock is
used
A
A
B
B
C
A
C
B
C
A gene tree can have a different branching order
from a species tree
7. Tree reconstruction
In any molecular phylogenetic reconstruction the following 4 points need to be addressed.
 Molecular sequences
12. Sequence alignment is the essential preliminary to tree reconstruction
13. Converting the alignment data into a phylogenetic tree
14. Assessing accuracy of a reconstructed tree
15. Molecular clocks enable the time of divergence of ancestral sequences to be estimated
7.1 Molecular sequences
Nucleic acids (rRNA, DNA) and protein sequences are used in molecular phylogenetic tree
construction. DNA yields more phylogenetic information than DNA and has become by the far
predominant molecule for phylogeny:
 More statistical information from DNA data: The nucleotide sequences of a pair of
homologous genes has a higher information content than the amino acid of the
corresponding proteins, because mutation that result in nonsynchrononymous
changes affect only the DNA sequence. Hence coding as well as non-coding regions
of the genome can be examined. Write out the DNA sequences or the following two
amino acids as an example of this. You can see that at the protein level there is only
1 difference but at the nucleic acid level there are 3 differences.
Protein1 -gly-ala-ile-leu-asp-argDNA1
DNA2
-gga-gcc-ata-tta-gat-aga
-gga-gca-att-ttt-gat-aga-
Protein2 -gly-ala-ile-phe-asp-arg
Ease of sequencing DNA: Samples for DNA sequencing can be prepared by PCR
which is an extremely easy technique.
Protein electrograms, Restriction fragment length polymorphism (RFLP), Simple sequence
length polymorphism (SSLP), Single nucleotide polymorphism (SNP) and DNA-DNA
hybridazation data have also been used for molecular phylogenetic reconstruction.
Immunological data from cross-reactivity studies were used in 1904 for such work as well.
7.2 Sequence alignment is the essential preliminary to tree construction.
This is the most important step in molecular phylogeny and a number of issues have to be
considered:
 Sequence Homologs: Sequences that are to be aligned should be homologs. An example of
this are the -globin genes of different vertebrates. This is to satisfy the phylogeny criteria
which states that the sequence should be derived from an common ancestral sequence.
 Non-homologous sequences: If the sequences are not homologous and hence do not share a
common ancestor phylogenetic construction methods will always produce a tree but the tree
will not be of any biological relevance. This type of error commonly occurs when
undertaking homology analysis to assign functions to newly generated gene sequences.
Blast is used extensively as on of the homology analysis methods and hence interpretation
of the data arising from the analysis should be undertaken with care.
 Easy alignments: Correctly aligning the homologous sequence is the next task. In some
cases it is an easy task. A simple sequence alignment is shown below:
Sequence 1
Sequence 2

Difficult alignments: If sequences have evolved and diverged by accumulating insertions
and deletions as well as point mutations, then these sequence are not always easy to align.
Insertions and deletions cannot be distinguished when pairs of sequences are aligned so we
refer to them as indels Below is a pair of difficult sequences for alignment where placing
the indel at the correct location can become a problem.
Sequence 1
Sequence 2
Sequence 1
Sequence 2

AGCAATGGCCAGACAATAATG
AGCTATGGACAGACATTAATG
*** **** ****** *****
GACGACCATAGACCAGCATAG
GACTACCATAGA-CTGCAAAG
*** ******** * *** **
Two possible positions for
the indel
GACGACCATAGACCAGCATAG
GACTACCATAGACT-GCAAAG
*** ********* *** **
The dot matrix technique for alignment: Some alignments can be easily done by "eye
balling" the sequences yet others may require a pen and paper. The simplest is known as the
dot matrix method. The two sequences are written out on the x- and y- axes of the graph
paper at the positions corresponding to the identical nucleotides of the two sequences. The
alignment is indicated by a diagonal series of dots broken by empty squares where the
sequences have nucleotide differences, and shifting from one column to another where
indels occur.
A G A C A T T T A G A C C A A
A
G
A
C
A
T
T
A
G
A
C
C
A
A
Figure: The dot matrix technique for sequence alignments

Similarity approach is a mathematical based alignment technique: The similarity approach
(Needleman and Wunesh, 1970) aims to maximise the number of identical matched
nucleotides in the two sequences. The distance method, (Waterman, 1976) on the other
hand, minimises the number of mismatches. Often the two approaches will identify the same
alignment as being the best one.


Multiple alignments are generated for more then two sequences: Rarely can one do multiple
alignments with a pen and paper and all the steps required for phylogenetic analysis is
undertaken on a computer. For automatically generating multiple alignments several
computer programs are available (discussed later)
rRNA genes (aka rDNA) and rRNA have been used as molecular chronometers and
phylogentetic studies undertaken. Refer to the section on rRNA for detailed notes on the
methods of aligning these types of nucleic acids.
7.3 Converting the alignment data into a phylogenetic tree