Download Sequence Alignment - Mainlab Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

RNA-Seq wikipedia , lookup

Proteolysis wikipedia , lookup

Genomic library wikipedia , lookup

Gene wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Non-coding DNA wikipedia , lookup

Molecular ecology wikipedia , lookup

Genetic code wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein structure prediction wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Bioinformatics for Research
Module 1
Sequence Alignment
September 1, 2015
Mainlab Bioinformatics, Washington State University
Learning Outcomes
• Understanding of what is sequence alignment and how is it
useful
• Understanding of the different types of sequence alignment
and when you might use them
• Understanding of the importance of homology
• Understanding scoring matrices and when to use them.
• Understanding that there are different alignment algorithms
• Basic understanding of BLAST and it’s different flavors
What is Sequence Analysis ?
• A sequence is ___________________________________
• A biological sequence is __________________________
______________________________________________
• Sequence analysis in bioinformatics refers to__________
______________________________________________
______________________________________________
______________________________________________
______________________________________________
______________________________________________
Concept of Sequence Alignment
• An alignment is a mutual arrangement of two sequences
• Shows where two sequences are similar, and where they differ
• An ‘optimal’ alignment – most correspondences and the least
differences
• Sequences that are similar probably have the same function
(descent from a common ancestor)
• Sequence alignment involves the identification of the correct
location of deletions and insertions that have occurred in
either of the two lineages since divergence from a common
ancestor.
Sequence Alignment
• Sequence alignment is the procedure of comparing (at the
residue level) two (pairwise alignment) or more sequences
(multiple sequence alignment) by searching for a series of
common features (characters or patterns) that occur in the
same order in the sequences.
• Sequence alignment is useful for discovering functional,
structural and evolutionary information in biological
sequences.
Importance of Homology!
• Homology strongly suggests that molecules have similar structure
and function
• Significantly similar molecular sequences are very unlikely to occur
by chance.
• Significant similarity between sequences infers that the
sequences/structures are homologous i.e. at some point in the
shared a common ancestor and therefore share structure and
function.
• Differences between families of species resulted from mutations
during the course of evolution. Most of these changes are due to
local mutations between nucleotide sequences
Orthologs
Sharing a common ancestor
Orthologs : occur in separate species, common ancestor
Ancestor
A1
Descendent 1
A2
Descendent 2
A
Time
Paralogs
Sharing a Common Ancestor
Paralogs : gene duplication independent of speciation (genome duplication)
Ancestor
A1
Descendent
A
A2
Time
Homology
• Homology designates a qualitative relationship of common
descent between entities
• Two genes are either homologs or they are not !
It doesn’t make sense to say “two genes are 43%
homologous”
It doesn’t make sense to say “Jane is 43% diabetic”
Sequence Alignment
• Changes in sequence may be found in alignments due to
divergence from the common ancestor.
• These changes are categorized as substitutions, insertions
and deletions.
• A primary use of sequence alignment is to determine if two
sequences are sufficiently similar to declare them
homologous and therefore likely to share similar structure
and function
• What else can we use sequence alignment for?
______________________________________
______________________________________
______________________________________
Sequence Alignment
• What does sequence identity mean? ___________________
_________________________________________________
• What does sequence similarity mean? __________________
_________________________________________________
• What is sequence homology? _________________________
_________________________________________________
_________________________________________________
_________________________________________________
_________________________________________________
Sequence Alignment Methods
Method
Use
Dot Matrix Plot
General exploration of your sequence:
- Discovering Repeats
- Finding Rearrangements
- Predicting regions of self complimentary RNA
- Extracting portions of sequence to make a multiple
alignment
Global Alignments
Comparing two sequences over their entire length:
- Identifying long insertion/deletions
- Checking the quality of your data
- Identifying every mutation in your sequence
Local Alignments
Comparing Sequences with partial homology:
- Making high quality alignments
- Making residue-per-residue analysis
Dot Matrix Plots
• Also known as dot plots, they represent the simplest method of
evaluating similarity between two sequences
• Identifies all possible matches of residues between the two
sequences.
• One sequence (A) is listed horizontally (top of page) and the
other sequence (B) vertically (left side of page).
• Starting with the first character in B, the comparison moves
across the row and places a dot in the plot space where both of
the sequence elements are the same.
• Adjacent regions of identity between the two sequences produce
diagonal lines of dots in the plot.
Dot Matrix Plots
• The diagonal line always
appears when a sequence is
compared to itself.
• Can filter out random matches
by using by increasing the
window size
•
In a dot matrix, detection of matching regions may be improved by filtering out
random matches.
• This is done using a sliding window to compare the two sequences
• Sliding windows
• Window size: Number of characters to compare
• Stringency: Number of characters that have to match exactly
Dot Matrix Plots
• Window size
• A larger window size is used for DNA sequences than for proteins
because the number of random matches is much larger due to the
use of only 4 DNA characters compared to 20 amino acid characters
• For DNA sequences comparisons use long windows and high
stringencies, e.g. 15 and 10.
• For protein comparisons, use short windows and low stringencies
except when looking for a short domain in a partially similar
sequence.
15
GAACTCATACGAATTCACATTAGAC
Dot Matrix Plots
• Try this with the example THEFATCAT as sequence 1 and
THEFASTCAT as sequence 2. What does the result tell you?
T
H
E
F
A
S
T
C
A
T
T H E F A T C A T
Sequence Analysis
Dot Plot Programs
• Dottup - Displays a wordmatch dotplot of two sequences
• Dotmatcher - Draw a threshold dotplot of two sequences
• Dotpath - Draw a non-overlapping wordmatch dotplot of
two sequences
• Polyplot - Draw dotplots for all-against-all comparison of a
sequence set
http://emboss.bioinformatics.nl/
Types of Sequence Alignment
• Global: Alignments that stretch over the entire sequence length include
as many matching residues as possible
GGSDNWSA-T IPG
Needleman-Wunsch Algorithm
GN–RAWA A MNPA
• Used to align two closely related sequences over similar length
• Useful for checking minor differences between two sequences,
analyze polymorphism between closely related species, comparing
two sequences that partially overlap
Types of Sequence Alignment
• Local: Higher priority given to aligning local regions of high similarity
rather than extending the alignment to neighboring residues with lower
scores.
-----DTGA----Smith-Waterman Algorithm
-----DTGA----• Dynamic programming - Smith Waterman Algorithm (provides the
best possible alignment, but slow!)
• Heuristic methods - BLAST and FASTA use fast approximate
methods to align two sequences.
Types of Sequence Alignment
• Local Cont.
• Heuristic algorithms are empirical (use rules of thumb to align)
• Much faster than dynamic programming algorithms so better
suited for database searches
• Does not guarantee an optimal alignment like dynamic
algorithms
Question – You have found a homolog to your unknown gene of interest
using BLAST, what might you do to optimize the alignment?
Alignment Algorithms
• Require a scoring system for evaluating match or mismatch
of 2 characters (aa or nt)
Substitution Matrices
•
Likelihood of One Amino Acid Mutated into Another Over
Evolutionary Time
•
Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7)
•
Positive Score: Conservative Substitution (e.g., Lys/Arg, +3)
•
High Score for Identical Matches: Rare Amino Acids (e.g.,
Trp, Cys)
Scoring Matrices
• There are several of them and the choice can affect the
outcome
• Values proportional to the probability one aa mutates into another
• Can be based on chemical similarity, functional similarity, structure,
evolutionary similarity etc
• Common matrices for protein comparisons
• PAM (Point Accepted Mutation) – based on global alignments of
closely related proteins that are at least 85% identical and are based
on an implicit model of evolution
• PAM250 matrix is widely used
• PAM is not necessarily good for identifying relationships in highly
divergent species. Does not account for conserved blocks or motifs
Scoring Matrices
• Blossum Matrices
• Look only for differences in conserved, ungapped regions of a
protein family
• Directly calculated, using no extrapolations
• More sensitive to structural or functional substitutions
• Generally perform better than PAM matrices for local similarity
searches (Henikoff and Henikoff, 1993)
• Blossum62
• Every possible identity and substitution is assigned a score based
on the observed frequencies of such occurrences in alignments
of related proteins
• BLOSUM 62 is a matrix calculated from comparisons of
sequences with no less than 62% divergence. Also Blossum 32 or
80
• Default for BLAST
Scoring Matrices
Alignment Algorithms Require
• A penalty function for gaps in sequences
• A method for finding an optimal pairing of sequences (may introduce
gaps to optimize the score)
• A gap is a space introduced into one alignment to compensate for
insertions and deletions in the sequences being compared
• For each gap introduced there is a penalty and extending the gap further
increases the penalty
AGGVLIIQVG
llllllxxxx
AGGVLIQVG-
AGGVLIIQVG
lllllxllll
AGGVL-IQVG
AGGVLIIQVG
llllllxlll
AGGVLI-QVG
Score the residue matches and
score the residue gaps
Heuristic Algorithms
•
•
•
•
•
FASTA: Based on K-Tuples (2-Amino Acid)
BLAST: Triples of Conserved Amino Acids
Gapped-BLAST: Allow Gaps in Segment Pairs
PHI-BLAST: Pattern-Hit Initiated Search
PSI-BLAST: Position-Specific Iterated Search
What is a heuristic algorithm?
FASTA Algorithm at EBI
http://www.ebi.ac.uk/Tools/fasta/index.html
BLAST Algorithms
BLAST (Basic Local Alignment Search Tool)
– To search a sequence against the database
– Extremely fast
– Robust
– Most widely used
• It finds very short segment pairs between the query and
sequence in the database
• These segments are then extended in both directions until the
maximum possible score of this particular segment is reached
• Available at NCBI, EBI and many other community database
sites
BLAST
• A BLAST search has five components: query, database,
program, search purpose/goal and results interpretation
o Query: a sequence that you want to find out more
information about
o Database: need to know what databases are available (NCBI,
EBI etc)
o Program: what program to select to meet your specific
purpose
o Interpreting your results, what does it mean?
NCBI Protein Databases
NCBI Nucleotide Databases
NCBI Nucleotide Databases
BLAST Program Selection
Nucleotide queries
Protein queries
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide
BLAST Program Selection
Specialized queries
Protein queries
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide
Optional parameters in (blastn)
Basic Optional Parameters in BLASTP
Advanced Blast Parameters
The default parameters are not always the right parameters for
your search, depends on your question
• G
• E
• Q
• R
•E
Cost to open gap: default = 5 for nucleotides/11 for proteins
Cost to extend gap: default =2 for nucleotides/1 for proteins
Penalty for nucleotide mismatch: default = -3
Reward for nucelotide match: default = 1
Expectation value: default = 10
BLAST Choice of Programs
•
MEGABLAST is specifically designed to efficiently find long alignments between very
similar sequences and thus is the best tool to use to find the identical match to your
query sequence
•
Discontiguous MEGABLAST is better at finding nucleotide sequences similar, but not
identical, to your nucleotide query. This program uses non-contiguous word within a
longer window of template. In coding mode, the third base wobbling is taken into
consideration by focusing on finding matches at the first and second codon positions
while ignoring the mismatches in the third position. Searching in discontiguous
MEGABLAST using the same word size is more sensitive and efficient than standard
blastn using the same word size
BLAST Choice of Programs
•
Search for short nearly exact matches" is useful for primer or short nucleotide searches. Short
sequences (less than 20 bases) will often not find any significant matches to the database
entries under the standard nucleotide-nucleotide BLAST settings. The usual reasons for this are
that the significance threshold governed by the Expect value parameter is set too stringently
and the default word size parameter is set too high.
You can adjust both the word size and the expect value on the standard BLAST pages to work
with short sequences
A common use of this page is to check the specificity of PCR or hybridization primers. A useful way to check a pair of PCR primers is
to first concatenate them by inserting string of 20 or more N's in between the two primers, and then search the concatenated pair
as one sequence. Since BLAST looks for local alignments and automatically searches both strands, there is no need to reverse
complement the reverse primer before doing the concatenation or the search.
BLAST Choice of Programs
•
Use the Trace Archive BLAST page to search raw primary sequence trace files.
The sequence data come from a variety of projects and sequencing strategies, including Whole
Genome Shotgun (WGS), BAC end sequencing, and EST sequencing. The trace data are single
pass sequencing reads not trimmed for quality or vector contamination. Their average lengths
are between 500 to 700 bp.
•
Standard protein BLAST is designed for protein searches. Standard protein-protein BLAST
(blastp) is used for both identifying a query amino acid sequence and for finding similar
sequences in protein databases. Like other BLAST programs, blastp is designed to find local
regions of similarity. When sequence similarity spans the whole sequence, blastp will also
report a global alignment, which is the preferred result for protein identification purposes.
•
PSI-BLAST is designed for more sensitive protein-protein similarity searches. Position-Specific
Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful for finding very
distantly related proteins or new members of a protein family. Use PSI-BLAST when your
standard protein-protein BLAST search either failed to find significant hits, or returned hits with
descriptions such as "hypothetical protein" or "similar to...".
Database Search Questions
• What database should I search?
• What kind of sequences should I search with?
• What E-value is significant?
• What can I reliably infer about the function of my sequence
based on homology?
Databases
• Bigger databases have more sequences.
• Bigger databases are also more redundant, which can
skew the statistics.
• Bigger databases are also poorly annotated (homology
with an "unidentified sequence" doesn't really tell you
much)
• Bigger databases take lots of time to search.
• Smaller databases (like Swiss-Prot) are often better
curated and annotated.
Databases
• Smaller databases are much less redundant.
• Smaller databases can contain phylogenetically relevant
sequences (all plant)
• Smaller databases are much faster to search.
What is a significant E-value?
• For a single search, an E-value of 10-3 is significant, though
typically quite distant.
• For multiple searches, the E-value cutoff varies according to
the number of searches.
Multiple Sequence Searches
e.g. 15,000 EST query sequences
• A 10-3 E-value cutoff means that you should expect
one false positive in 1000 searches.
• Thus with 15,000 searches, we should expect 15 false
positives with a cutoff of 10-3.
• To reduce the chances of identifying a false positive,
set the E-value cutoff lower.
• For 15,000 searches, an E-value cutoff of 10-5 will
mean that you should expect 0.15 false positives.
Multiple Sequence Searches
In general:
• DNA to DNA alignment
For nucleotide sequences at least 100 bp long, if 70% of
your nucleotides are identical with your match sequence
then they can be considered to be homologous
• AA to AA alignment
For amino acid sequences at least 100 aa long, if 25% of
your aa are identical with your match sequence then
they can be considered to be homologous.
• Below these values, the alignments are considered to be
in the twilight zone!