Download Gapped Blast and PSI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extrachromosomal DNA wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Metagenomics wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Deoxyribozyme wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

DNA vaccination wikipedia , lookup

Non-coding DNA wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Expanded genetic code wikipedia , lookup

Microsatellite wikipedia , lookup

Genomics wikipedia , lookup

Genetic code wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Sequence alignment wikipedia , lookup

Transcript
Gapped Blast and PSI
BLAST
Basic Local Alignment Search Tool
~Sean Boyle
The BLAST Topics
•
•
•
•
•
Exactly What is BLAST?
A Quick Recap of Profiles
A Few Statistics Behind the BLAST Program
The Progression to Gapped BLAST
The advancements in PSI BLAST
Exactly What Is BLAST?
•
•
•
•
•
Blast Programs are used for searching both protein and DNA
databases for sequence similarities.
BLAST programs can compare protein to protein, DNA to DNA,
Protein to DNA, or DNA to protein.
The DNA sequences used in comparison are usually conceptually
transcribed before comparison.
BLAST programs use a threshold value which can be adjusted to
alter speed and probability. A higher value of T will give greater
speed, but also a larger probability of missing weaker similarities.
Can use various substitution matrices such as Blosum(62) or PAM
250.
A Quick Recap of Profiles
•
•
•
•
A sequence profile is a position specific scoring matrix
generated from a group of aligned sequences and a basic
scoring matrix.
A profile will have L rows and 22 columns or vice versa.
Amino acid matrix scores are multiplied by the ratio of that
amino acid in the sequences being compared over the entire
number of amino acid possibilities in the matrix.
A consensus sequence or profile is then derived and used in
future comparisons.
A Few Statistics Used in BLAST
•
•
•
•
•
Firstly we require that the expected score for two random
amino acids ΣPiPjSij to be negative.
Now we can calculate two parameters λ and K. These two
variables allow for a normalized scoring system through the
equation S‘ = (λS – ln K) / (ln 2).
S’ can now be plugged into the equation E = N/2^s’.
E-Value > 0.01 = will return more loosely related similarities.
E-Value <= 1*10^-5 will return more strictly related similarities.
The Progression to Gapped BLAST
•
•
•
•
•
Original BLAST program did not take gaps into account.
BLAST used to look for single alignments of at least length T.
Each positive alignment “hit” was then extended.
Gapped BLAST now allows for two non-overlapping
alignments of length T within distance A of one another.
These alignments “hits” are then extended.
Gapped BLAST allows for gap initiation and extension.
ABCDE
ABCDE
ACD - A–CD–
(Original Blast)
(Gapped Blast)
PSI BLAST
•
•
•
•
Position-Specific Iterated BLAST
Incorporates position specific matrices “profiles”
Often much better at detecting weak similarities
Before PSI BLAST the same techniques were
used, but a large degree of expertise and human
intervention was required
Score Matrix Architecture
•
Profiles very similar to scoring matrix
–
Protein or nucleotide aligns to profile position
–
New profile created with every iteration
•
•
•
Gap costs may be position-specific with profiles.
How position specific protein score matrices draw their power
–
–
•
Profiles created in turn i used in turn i+1
Improved estimation of the probabilities with which amino acids
occur at various pattern positions
Relatively precise definition of the boundaries of important motifs
Every matrix constructed has a length exactly the same as the
original query sequence
Multiple Alignment Construction &
Sequence Weights
•
All database sequences whose aligned E-value is below a
specific threshold are added to the query
Any row (or column) which is >= 98% identical to a previously
added alignment is kept out of the profile
•
–
•
•
Allows for better searching on later iterations
Poor restrictions could lead to large scale profile sequence
insertion
Sequences are given different weights depending on
evolutionary importance
PSI BLAST Overview
•
Start off with query and initial score matrix (BLOSUM 62)
–
–
•
A profile(p1) is constructed from the passing sequences and
score matrix
–
–
•
Homologs are found using BLAST (align DB to query)
E-Value is used as criteria for sequence insertion into profile
Once again search for homologs using BLAST(align DB to profile)
Once again use E-Value as criteria for insertion into profile
A profile(p2) is constructed from the approved sequences and
score matirx.