Download Bioinformatics Dr. Víctor Treviño Pabellón Tec

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Point mutation wikipedia , lookup

DNA barcoding wikipedia , lookup

Transposable element wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

RNA-Seq wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

NUMT wikipedia , lookup

Non-coding DNA wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Genomic library wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Microsatellite wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome evolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Human genome wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Human Genome Project wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Genome editing wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Sequence alignment wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Transcript
BIOINFORMATICS
DR. VÍCTOR TREVIÑO
[email protected]
Multiple Sequence Alignments
and
Phylogeny
[email protected]
SEQUENCE SIMILARITY

Within a protein sequence, some regions will be
more conserved than others. As more conserved,
more important.







for function
for 3D structure
for localization
for modification
for interaction
for regulation/control
for transcriptional regulation
(in DNA)
REASONS TO
PERFORM
SEQUENCE
SIMILARITY
ANALYSIS
AND
SEARCHES
[email protected]
SEQUENCE ALIGNMENT

Procedure for comparing two (pair-wise alignment) or
more (multiple sequence alignment) sequences by
searching for similar patterns that are in the same
order in the sequences

Overall
similitude

Identical residues (nt or aa) are placed in the same column
Non-identical residues can be placed in the same column or
indicated as gaps
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
Wikipedia, http://www-personal.umich.edu/~lpt/fgf/fgfrcomp.htm
[email protected]
MULTIPLE SEQUENCE ANALYSIS –
ADDITIONAL USES
Interesting regions
 Promoter regions
 Consensus sequence for probe design

[email protected]
Multiple Sequence Alignment - MSA
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
MULTIPLE SEQUENCE ALIGNMENT - MSA
Dynamical programming is designed for two
sequences
It would take quite a long time for three or more (see
MSA program)
Sequence B

Sequence A
[email protected]
RELATION MSA & EVOLUIONARY TREE
RECONSTRUCTION
[email protected]
MULTIPLE SEQUENCE ALIGNMENT –
METHODS

Extenstions of sequence pair alignment
 MSA

Progressive Methods
 CLUSTALW
Iterative Methods
 Hidden Markov Models (HMM)

[email protected]
MULTIPLE SEQUENCE ALIGNMENT - MSA
Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
Calculate all pair-wise alignment scores (alignment
costs).
Use the scores (costs) to predict a tree.
Calculate pair weights based on the tree.
Produce a heuristic msa based on the tree.
Calculate the maximum for each sequence pair.
Determine the spatial positions that must be
calculated to obtain the optimal alignment.
Perform the optimal alignment.
Report the epsilon found compared to the maximum
epsilon.
epsilon for a given sequence pair is the difference between the score of the alignment of that pair in the msa and the score of
the optimal pair-wise alignment. The bigger the value of , the more divergent the msa from the pair-wise alignment and the smaller
the contribution of tht alignment to the msa. For example, if an extra copy of one of the sequences is added to the alignment
project, then for sequence pairs that do not include that sequence will increase, indicating a lesser role because the contributions
of that pair have been out-voted by the alike sequences.
[email protected]
PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT
Dynamical programming is designed for two
sequences

It would take quite a long time for three or more (see
MSA program)
Therefore…
Pair-wise all sequences
Determine "distances between each one"
Align the two most similar then get the alignment
Get the next more similar and perform the same steps
until all sequences has been included
E.G.
1.
2.
3.
4.
5.
1.
2.
3.
4.
(S3+S4)=c1,
(S1+S2)=c2
(c1+c2)=c3
(c3+S5)=final
S1
S2
S3
S4
S5
[email protected]
PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT CLUSTALW
CLUSTALW
METHOD
(then normalized to
largest = 1)
Alignment Score
for column
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT CLUSTALW
3
1
2
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT PROBLEMS

Dependency on the most similar sequences
 Nested
problems when most similar sequences are
actually different
 So,

for closely related sequence, CLUSTALW is the best
Choice of suitable scoring matrices
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
ITERATIVE MULTIPLE SEQUENCE ALIGNMENT
Try to correct for the dependency on the most
similar sequences in progressive methods
 Repeatedly realigning subgroups, then aligning
these on the global alignment

 Based
in tree ordering, separation of sequences, or
random grupo selection
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
ITERATIVE MULTIPLE SEQUENCE ALIGNMENT
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
HMM MULTIPLE SEQUENCE ALIGNMENT
D1
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
MULTIPLE SEQUENCE ALIGNMENT - PROGRAMS
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
MULTIPLE SEQUENCE ALIGNMENT - OVERVIEW
[email protected]
PHYLOGENY ANALYSIS AND
PREDICTION FROM DNA/PROTEIN
SEQUENCES
Determination of how the family might have
been derived during evolution
 Sequences is depicted as branches on a tree
 Very similar sequences are located as
neighbours in a branch
 The goal is to discover all the branching
relationships and the branch lengths

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
PHYLOGENY ANALYSIS AND
PREDICTION FROM DNA/PROTEIN
SEQUENCES
 Phylogenetic relationships among the genes can help


to predict which ones might have an equivalent
function.
Phylogenetic analysis may also be used to follow the
changes occurring in a rapidly changing species, such
as a virus
Important for discovering

function, 3D structure, localization, modification,
interaction, regulation/control, transcriptional regulation
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
PHYLOGENY ANALYSIS AND
PREDICTION FROM DNA/PROTEIN
SEQUENCES

Related to SEQUENCE ALIGNMENT
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
SEQUENCE SIMILARITY – EVOLUTIONARY
RELATIONSHIP
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
GENOME COMPLEXITY
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
GENOME COMPLEXITY
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
EVOLUTIONARY TREE



An evolutionary tree is a two-dimensional graph
showing evolutionary relationships among organisms
The separate sequences are referred to as taxa
(singular taxon), defined as phylogenetically distinct
units on the tree
The tree is composed of outer branches (or leaves)
representing the taxa and nodes and branches
representing relationships among the taxa
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
EVOLUTIONARY TREE


A and B are derived
from a common
ancestor
each node in the tree
represents a splitting of
the evolutionary path of
the gene into two
different species that
are isolated
reproductively
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
EVOLUTIONARY TREE


Beyond spliting, any further
evolutionary changes in
each new branch are
independent of those in
the other new branch
The length of each branch
to the next node
represents the number of
sequence changes that
occurred prior to the next
level of separation
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
EVOLUTIONARY TREE



Uniform mutation rate 
Molecular Clock
Hypothesis, suitable for
closely related species
Special cases could use
non-uniform rates
The root is defined by
including a taxon that we
are reasonably sure
branched off earlier than
the other
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
EVOLUTIONARY TREE




The sum of all the branch lengths in
a tree is referred to as the tree
length.
The tree is also a bifurcating or
binary tree, in that only two
branches emanate from each node.
Trees can have more than one
branch emanating from a node if
the events separating taxa are so
close that they cannot be resolved,
or to simplify the tree.
The unrooted tree also shows the
evolutionary relationships among
sequences A–D, but it does not
reveal the location of the oldest
ancestry.
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
EVOLUTIONARY TREE

The number of possible rooted trees increases
very rapidly with the number of sequences or
taxa
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
METHODS TO BUILD EVOLUTIONARY TREES
To find the evolutionary tree or trees that best
account for the observed variation in a group of
sequences
Maximum Parsimony
 Distance
 Maximum Likelihood

[email protected]
METHOD SELECTION
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
CONSIDERATIONS

Not Large number of gaps
 Phylogenetic
methods analyze conserved
regions that are represented in all the
sequences (Local Alignments)
[email protected]
MAXIMUM PARSIMONY
(OR MINIMUM EVOLUTION)



Predicts the evolutionary tree by minimizing the
number of steps required to generate the observed
sequence changes
Requires a multiple sequence alignment
Method revise each informative position and each
possible tree


same residue in at least two sequences but not all
Used for sequences that are quite similar and for
small number of sequences
[email protected]
MAXIMUM PARSIMONY
(OR MINIMUM EVOLUTION)
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
Non
informative
[email protected]
DISTANCE METHODS




Employs the number of changes between each pair
Sequence pairs that have the smallest number of
sequence changes are "neighbours" sharing a node in
the tree
Very related to Multiple sequence alignment method
(CLUSTALW) which produced DISTANCE MATRICES
then analysed by distance methods
Remember Distance vs Similarity (and gaps)
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
DISTANCE METHODS
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
"Idealized"
[email protected]
DISTANCE ALGORITHMS
Fitch and Margoliash Method
 Neighbor-joining Method
 Unweighted Pair Group Method with Arithmetic
Mean (UPGMA)

[email protected]
DISTANCE ALGORITHM

Choosing a outgroup (Grupo Fuera) improves
prediction because methods are informed
about the "order" of the outgroup
[email protected]
MAXIMUM LIKELIHOOD
Uses probability of the number of sequence
changes
 Analysis is performed for each informative
residue (like in maximum parsimony)
 All possible trees are considered (so, for small
number of sequences)
 Consider variations in mutation rates, so it can
be used for most distant sequences
 Main disadvantage: Computation Time

[email protected]
MAXIMUM LIKELIHOOD

Needs a model that provides estimates of
substitution rates for each residue pair
[email protected]
RELIABILITY OF PHYLOGENETIC
PREDICTIONS

Bootstrap method randomly resampling
residues within columns (robustness test)
 Good
evidence if more than 70% predictions are
conserved then
Collapse branches and confirm tree length
 Compare distinct methods and parameters

[email protected]
"CLASSIC" PROGRAMS

PHYLIP
http://evolution.genetics.washington.edu/phylip.html
PAUP
http://paup.csit.fsu.edu/downl.html

Phylemon
http://phylemon.bioinfo.cipf.es/cgi-bin/tools.cgi

PHYLEMON WEB SERVICE
[email protected]
PROGRAMS – WEB SERVICES
http://bioinformatics.ca/links_directory/index.php?search=phylogeny&submit=Search+Directory
[email protected]
PROGRAMS – WEB SERVICES
http://bioinformatics.ca/links_directory/index.php?search=phylogeny&submit=Search+Directory
[email protected]
BOOK
[email protected]
[email protected]
EXERCISE/HOMEWORK





Select a gene
Get the sequence in at least 7 species
Select a site (Phylemon)
Perform the multiple sequence alignment
(ClustalW)
Perform Phylogeny to obtain a tree

At least 2 tree methods
At least 3 parameter(s) changes
Take DNA/Protein

Report results and discussion


12 MSA+Trees
[email protected]
PAPERS TO REVISE

Phylogeny-aware gap placement prevents
errors in sequence alignment and evolutionary
analysis – Loytynoja, Goldman, Science 2008
 Insertions
and deletions treated as different events
[email protected]
PAPERS PENDING FOR THIS SESSION