Download GCG/SeqLab Course: MULTIPLE COMPARISON

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
DNA/Protein Sequence Analysis: Multiple
Comparison
PILEUP
PileUp creates a multiple sequence alignment from a group of related sequences using
progressive, pairwise alignments. It can also plot a tree (dendrogram) showing the
clustering relationships used to create the alignment.
PileUp creates a multiple sequence alignment using a simplification of the progressive alignment
method of Feng and Doolittle (Journal of Molecular Evolution 25; 351-360 (1987)). The method
used is similar to the method described by Higgins and Sharp (CABIOS 5; 151-153 (1989)).
The multiple alignment procedure begins with the pairwise alignment of the two most similar
sequences, producing a cluster of two aligned sequences. This cluster can then be aligned to the
next most related sequence or cluster of aligned sequences. Two clusters of sequences can be
aligned by a simple extension of the pairwise alignment of two individual sequences. The final
alignment is achieved by a series of progressive, pairwise alignments that include increasingly
dissimilar sequences and clusters, until all sequences have been included in the final pairwise
alignment.
Before alignment, the sequences are first clustered by similarity to produce a dendrogram, or tree
representation of clustering relationships. It is this dendrogram that directs the order of the
subsequent pairwise alignments. PileUp can plot this dendrogram so that you can see the order of
the pairwise alignments that created the final alignment.
As a general rule, PileUp can align up to 500 sequences, with any single sequence in the final
alignment restricted to a maximum length of 7,000 characters (including gap characters inserted
into the sequence by PileUp to create the alignment). However, if you include long sequences in
the alignment, the number of sequences PileUp can align decreases.
Screen Monitoring:
PileUp names each sequence to be aligned as it is read in. It then displays the message,
determines pairwise similarity scores, and shows a quality ratio for every pairwise alignment.
This ratio is the alignment's quality divided by the length of the shorter sequence. If x is the
number of sequences to be aligned, there are (x(x-1))/2 pairwise alignments whose ratio must be
calculated.
Next PileUp displays the message Aligning... as it performs each of the pairwise alignments that
together create the final multiple sequence alignment. There are x-1 alignments in this part of the
program.
Input Files:
PileUp accepts multiple (two or more) nucleotide sequences or multiple (two or more) protein
sequences as input. You can specify multiple sequences in a number of ways: by using a list file,
for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using
a sequence specification with an asterisk (*) wildcard, for example *pep. The function of PileUp
depends on whether your input sequence(s) are protein or nucleotide. Programs determine the
type of a sequence by the presence of either, Type: N or Type: P on the last line of the text
heading just above the sequence.
Restrictions:
PileUp restricts each sequence in the final alignment to a maximum length of 7,000 characters.
This maximum length includes the input sequence length plus the total length of all gap
characters inserted into the sequence to create the final alignment. By default, each input
sequence is restricted to a maximum length of 5,000. Also by default, PileUp can add a
maximum of 2,000 gap characters for each sequence in the final alignment.
If you wish to align longer sequences, then you can specify a maximum sequence length of up to
7,000 bp. If you increase the maximum sequence length in this way, then the maximum amount
of allowed gapping is automatically reduced so that the final aligned sequence length cannot
exceed 7,000 for any sequence.
If you wish to allow for more gapping in the final alignment, then you can specify a maximum
number of gap characters for each sequence. If you increase the maximum amount of gapping
permitted for each sequence in this way, the maximum sequence length is automatically
decreased so that the final aligned sequence length cannot exceed 7,000 for any sequence.
The total length of all of the sequences read into PileUp (including the gap allowance for each
sequence) cannot be greater than 2,000,000. By reducing the gap allowance for each sequence
you can increase the number of sequences that can be read into the program up to the maximum
of 500 sequences.
Algorithm:
A rigorously optimal alignment of even a small number of short sequences would be intractable,
both in terms of memory and time. Therefore, PileUp does a series of progressive, pairwise
alignments between sequences and clusters of sequences to generate the final alignment. A
cluster consists of two or more already-aligned sequences.
PileUp begins by doing pairwise alignments that score the similarity between every possible pair
of sequences. These similarity scores are used to create a clustering order that can be represented
as a dendrogram. The clustering strategy represented by the dendrogram is called UPGMA that
stands for unweighted pair-group method using arithmetic averages (Sneath, P.H.A. and Sokal,
R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San
Francisco, California, USA).
The dendrogram shows the order of the pairwise alignments of sequences and clusters of
sequences that together generate the final alignment. For example:
PileUp uses this clustering order and first aligns the two most-related sequences to each other in
order to produce the first cluster. It then aligns the next most related sequence to this cluster or
the next two most-related sequences to each other in order to produce another cluster. A series of
such pairwise alignments that includes increasingly dissimilar sequences and clusters of
sequences at each iteration produces the final alignment.
In the above example, Seq1 and Seq2 are aligned first. Next, Seq3 and Seq4 are aligned. The
cluster of Seq1-aligned-to-Seq2 is then aligned to the cluster of Seq3-aligned-to-Seq4. Finally,
Seq5 is aligned to the cluster that now contains Seq1 through Seq4 to generate the final
alignment of Seq1 through Seq5.
Each pairwise alignment in PileUp uses the method of Needleman and Wunsch (Journal of
Molecular Biology 48; 443-453 (1970)), that is extended for use with clusters of aligned
sequences rather than only individual sequences. For a pairwise alignment of individual
sequences, the comparison score between any two sequence symbols is found in a scoring matrix.
For a pairwise alignment of clusters of sequences, the comparison score between any two
positions in those clusters is simply the arithmetic average of the scores for all possible symbol
comparisons at those positions. When gaps are inserted into a cluster to produce an alignment,
they are inserted at the same position in all of the sequences of the cluster.
Because a rigorous optimal alignment of even a small number of short sequences would be
intractable, PileUp uses an approach that may not produce the most optimal multiple sequence
alignment.
Clustering
The approach used by PileUp is sensitive to the order in which sequences are aligned. A
clustering algorithm determines this order from the pairwise similarities calculated before the
final alignments are done. The goal of the clustering is to see that very similar sequences are
aligned to each other before they are aligned to more distantly related sequences. There is, at
present, no way for you to modify the order of these alignments.
While PileUp calculates the similarity between each of the sequences, this information is not
used by the program to weight the sequences. That is, if there are several very similar sequences,
the final alignment may be constrained to minimize the disruption of these sequences.
The dendrogram is not a phylogenetic reconstruction, although the vertical branch lengths are
proportional to the distance between the sequences. Its purpose is to represent the clustering
order used to create the final alignment. This order is the only information from the dendrogram
used by PileUp. See the RELATED PROGRAMS topic for a description of programs in the
Wisconsin Package that you can use to create phylogenetic reconstructions from multiple
sequence alignments.
Global Alignment
If you know the difference between Gap and BestFit, consider PileUp an extension of the Gap
program for more than two sequences, rather than an extension of the BestFit program. PileUp,
like Gap, tries to find a global optimal alignment, while BestFit finds a local optimal alignment.
Because PileUp aligns sequences along their entire lengths, it is not ideally suited to finding the
best local region of similarity (such as a shared motif) among all of the sequences. However,
PileUp has been used successfully for this purpose.
By default, PileUp does not penalize gaps occurring at the ends of sequences. Therefore, related
sequences that differ in the extent of their sequencing can be reasonably aligned by PileUp. You
can override this default by selecting -ENDWeight, in which case length differences among the
sequences become significant.
Piling Up Unrelated Sequences
PileUp always aligns all of the sequences you specify, even if they are not related. The alignment
can be degraded if some of the sequences are not similar to one another.
Arbitrary Gap Placement
In any pairwise alignment, the position of the inserted gaps may be arbitrary; equally optimal
alignments can be generated by inserting the gaps differently. PileUp can exaggerate these
arbitrary differences if you select either the -LOWroad or -HIGhroad parameters. This selection
usually affects the final alignment. For the most part, however, the difference between the high
road and low road alignments should not be very significant, although you may want to check.
Here is an example showing the difference between high and low road for the alignment of three
short sequences. The first pairwise alignment creates an aligned cluster of the two most closely
related sequences; the second alignment aligns this cluster to the third sequence creating the final
multiple sequence alignment. Although the qualities after the first round alignments are the
same, the quality of the final low-road alignment is higher than the high-road one.
For:
Alignment
1
Alignment
2
Match = 10
Mismatch =
0
Gap weight = 10
Length weight =
HighRoad
LowRoad
GACCAT
GAG.AT
GACCAT
GA.GAT
GACC.AT
GAG..AT
AACGGAT
Quality = 30
Quality = 25
GAC.CAT
GA..GAT
AACGGAT
0
Quality = 30
Quality = 30
High road alignments shift all of the arbitrary gaps in the second sequence or cluster of aligned
sequences to the right and all of the arbitrary gaps in the first sequence or cluster of aligned
sequences to the left. Low road alignments do the opposite. When neither high road nor low
road is selected, the program tries not to insert a gap whenever that is possible and uses the high
road when that is not possible.
Scoring Matrices
The default scoring matrices are not necessarily appropriate for all alignments. Several
alternative scoring matrices suitable for multiple sequence alignments are provided. PileUp
chooses default gap creation and extension penalties that are appropriate for the scoring matrix it
reads. If you select a different scoring matrix the program will adjust the default gap penalties
accordingly.
The following exercise will use several sequences that you will need to transfer to your
main list. There are 2 types of sequences, nucleic acid and peptides. Please place all these
sequences into the list you created for this course.
From the UniProt database:
P17538
P15157
P40313
P08218
P00750
P07477
P03951
Q04756
From the GenBank:primate database:
M24400
BC005385
BT007356
BC063475
The first PileUp that we will perform is a protein/peptide multiple sequence alignment.
Select the protein/peptide sequences from UniProt (there are 8 sequences). Move your
cursor to Functions and select PileUp from the Multiple Comparison Menu.
The following screen will appear. Click on the Options button to display the options menu.
For this exercise, select “don’t penalize gaps at the ends...”, “select top alignment...”,
“sequence ordered by similarity....”, and “Plot dendrogram....”. Close the Options window
and Select Run from the PileUp Main window.
To view the output of this alignment, click here. (link to pileup_output.txt)
The next figure provides the dendrogram of the PileUp alignment. This is not a phylogenetic
tree, only a representation of the pairwise comparison used to create the multiple sequence
alignment.
Close the dendrogram window and go to the Output Manager. Select the “.msf” file and
add this file to the Main Window. This will load the sequences into your temporary list,
which we will use later.
Your Main Window should now contain these sequences.
To become a little more familiar with the PileUp program, select the nucleic acid sequences
that are in your main list and run the PileUp program. From the Options Menu, select
options that we did not use for the peptide alignment. After you have briefly reviewed the
results, be sure to add the “.msf” file from this alignment to your Main List.
PlotSimilarity
PlotSimilarity calculates the average similarity among all members of a group of aligned
sequences at each position in the alignment, using a user-specified sliding window of
comparison. The window of comparison is moved along all sequences, one position at a time,
and the average similarity over the entire window is plotted at the middle position of the window.
The average similarity across the entire alignment is plotted as a dotted line.
If you give PlotSimilarity a single input sequence, you can choose the range and strand for that
sequence, and then PlotSimilarity prompts you for the name, range, and strand of a second input
sequence. In this way, you can plot the average similarity between the two aligned sequences
created with GAP output files.
PlotSimilarity accepts multiple (two or more) aligned nucleotide sequences or aligned protein
sequences as input. The multiple sequence alignment created by the PileUp program can be used
as input to PlotSimilarity. The gapped output files from the Gap and BestFit programs, which
were created using the Options Menu, can also be used as input to PlotSimilarity. If the first
sequence entered into PlotSimilarity is a single sequence, the program prompts you for the
second sequence.
Algorithm:
The average similarity at a position in an alignment is the arithmetic average of the scores of all
possible pairwise symbol comparisons among the sequence symbols at that position. The
comparison score between any two sequence symbols is the comparison value between those
symbols in the scoring matrix multiplied by the weight of each of the two sequences. The average
similarity across the entire alignment (plotted as a dotted line) is the sum of the separate window
similarities divided by the number of windows.
If “plot the level of identity....” is selected, the program plots a measure of the level of identity
among all sequences in the multiple sequence alignment. The calculations are done exactly as
described above, but all identical symbol comparisons are given a value of 1; all other
comparisons are given a value of 0.
If -PROFile is selected, the program plots a running average of the positional conservation in a
profile. The measure of conservation at any position is the difference between the greatest and
least values at that position in the profile. The profile is created in a program called ProfileMake.
This provides a very comparable result to selecting “Include the plot of overall similarity”, that
does not require a Profile to be created.
The PlotSimilarity program provides a graphical representation of a multiple sequence alignment
or two sequences generated by BestFit or GAP (you must use the individual sequences generated
from the Pairwise Sequence Analysis programs). For this exercise, select the “.msf” file from
the peptide alignment, move your cursor to Functions and select PlotSimilarity from the
Multiple Comparison Menu.
Next, click on the Options button to enter the Options menu.
From the Options Menu, select “continuous curve”, “Include the plot of overall similarity”,
and “minimum and maximum values calculated.....”. Close the Options Menu and select
Run from the PlotSimilarity window.
The following page will be displayed that contains a graph of the similar regions identified by the
PileUp program.
PRETTY
Pretty displays multiple sequence alignments and calculates a consensus sequence. It does
not create the alignment; it simply displays it.
Pretty prints sequences with their columns aligned and can display a consensus for the alignment,
allowing you to look at relationships among the sequences. This program can be used for aligned
sequences in an MSF (multiple sequence format) or RSF (rich sequence format) file, or for
separate sequences that have had gaps added to make them all align.
Pretty accepts multiple (one or more) aligned nucleotide sequences or aligned protein sequences
as input. You can specify an MSF file, such as the output file from a session with PileUp, as
input to Pretty such as pileup.msf{*}. Weights can be specified for sequences in MSF files.
(See the Vote Weight discussion below.)
Weighting Sequences (Vote Weight):
If several of your sequences are very similar, you may not want their votes to dominate the
consensus for the column. The vote weight is the vote that each row casts for the consensus. A
weight of 1.0 is assumed if no vote weight is specified.
You can assign vote weights to sequences in an MSF file by editing the MSF file and modifying
the weight on the name/weight line for each sequence at the top of the file.
For this exercise, we will use the “.msf” file that you created from the nucleic acid
alignment. Select the file and move your cursor to Functions Menu and select PRETTY
from the Multiple Comparison Menu.
From the Main PRETTY window click on the Options button.
From the Options Menu, select “Display consensus sequence”, and “show positions
agreeing....”. Close the Options Menu and select Run from the main PRETTY window.
To view an example PRETTY output file, click here. (link to pretty_output.txt)
ProfileMake
ProfileMake creates a position-specific scoring table, called a profile, that quantitatively
represents the information from a group of aligned sequences. The profile can then be used for
database searching (ProfileSearch) or sequence alignment (ProfileGap).
ProfileMake uses the method of Gribskov, et al (Proc. Natl. Acad. Sci. USA 84; 4355-4358
(1987)) to create a profile from a group of aligned sequences. A profile is a table that contains all
of the comparison information of a group of aligned sequences. These sequences must be
previously aligned before running ProfileMake. The profile contains as many rows as there are
positions in the aligned sequences. Each row contains a score for the alignment of the
corresponding position of the aligned sequences with each possible base or residue.
The profile is the input data for ProfileSearch, which can find sequences in the database similar
to your group of aligned sequences, and ProfileGap, which can make an optimal alignment
between the aligned sequences and another sequence.
ProfileMake accepts multiple sequences (two or more) all of the same type. You can specify
multiple sequences by using an MSF file, for example project.msf{*}; or by using a sequence
specification with an asterisk (*) wildcard, for example *pep. The function of ProfileMake
depends on whether your input sequence(s) are protein or nucleotide. Programs determine the
type of a sequence by the presence of either Type: N or Type: P on the last line of the text
heading just above the sequence.
ProfileMake makes a profile from a multiple sequence alignment. ProfileSearch uses the profile
to search a database for sequences with similarity to the group of aligned sequences.
ProfileSegments displays optimal alignments between each sequence in the ProfileSearch output
list and the group of aligned sequences (represented by the profile consensus). ProfileGap makes
optimal alignments between one or more sequences and a group of aligned sequences represented
as a profile. ProfileScan finds structural and sequence motifs in protein sequences, using
predetermined parameters to determine significance.
Algorithm:
Similarity Scores
In a scoring matrix, a score can be found for the comparison of any two sequence symbols.
Given a group of aligned sequences, a score can be calculated for the comparison of a symbol to
each position of the aligned sequences. This comparison score differs from position to position
in the aligned sequences, because each position contains a different spectrum of sequence
symbols. The overall score is, in a sense, the average of the comparison scores for the sequence
symbols found at a particular aligned sequence position.
Each row of a profile contains the scores for a comparison of the corresponding position of a
multiple sequence alignment to each possible sequence symbol. For example, if a profile is made
from a group of aligned protein sequences, the 10th row of the profile has values for the
comparison of the 10th position in the alignment to each possible amino acid. The profile has as
many rows as there are positions in the alignment, and each row has as many comparison scores
as there are amino acid symbols. Thus, the profile is a position-specific scoring matrix for every
position in a multiple sequence alignment.
The consensus sequence character is the symbol with the largest value in each row of the profile.
It is used solely for the display of alignments and not for the calculation of the optimal alignment
between a profile and a sequence.
The last row of the profile contains the composition for the whole profile. In the A column, for
instance, the total number of A's in the multiple sequence alignment is shown.
Sequence Symbol Weights
As stated above, the comparison score of an alignment position and a given sequence symbol is
an average of the comparison scores for the different sequence symbols at that position. This
average is weighted so that a symbol's weight in the calculation of the average score increases
along with its fraction of the symbols at that position. Two types of weighting are currently used.
Linear weighting gives a weight to each symbol that is directly proportional to the number of
occurrences of that symbol at a given position. The default logarithmic weighting gives a symbol
that predominates at a given position a disproportionately higher weight than a symbol that
occurs only once. This causes positions in the aligned sequences that have many identical
residues to bias the profile more strongly towards the identical residues than when linear
weighting is used.
Using either kind of weighting, the weight for a residue is 0 when that residue does not occur at a
given position; the weight is 1 when only that residue is found at a given position.
If the number of aligned sequences is fairly small, the sequence symbols observed at each
position of the alignment may not represent the whole spectrum of symbols that would be
observed if more sequences were available. In these cases, even residues that are not observed at
a given position in the alignment should perhaps be given a small weight. For nucleic acids, nonobserved bases are given a weight of 0 by default. The default for proteins is to give nonobserved amino acids a weight equal to 0.025 divided by the sum of the sequence weights. The
-STRINgent command-line parameter gives non-observed sequence symbols a weight of 0.
Gap Coefficients
The profile also includes position-specific gap coefficients, expressed as percentages. The gap
coefficient determines the penalty that an alignment must pay in order to create a gap, and the
gap length coefficient determines the penalty that must be paid in order to extend a gap. The
actual gap penalties are calculated by multiplying the position-specific gap coefficients by the
gap penalties specified when running the other Profile programs.
All gaps in the aligned sequences that overlap are treated as a single gap for purposes of
calculating gap coefficients. The gap is considered to begin at the position of the leftmost gap
character (. or ~) in any of the sequences, and to end at the rightmost gap character. The
position-specific gap coefficients are reduced from 100 percent as a function of the longest gap
through the position of interest in the aligned sequences. The gap coefficient G and gap length
coefficient L are calculated as:
G = C(G) x ( R(G) / (1 + GapLength x R(L) )
L = C(G) x ( R(G) / (1 + GapLength x R(L) )
Where GapLength is the length of the gap as defined above. GapCoefficient (C(G)), GapRatio
(R(G)), and GapLengthRatio (R(L)) have default values of 100, 0.33, and 0.1 respectively, but
can be changed by optional parameters entered on the command line (see the COMMAND LINE
SUMMARY topic below).
You can edit the profile with a text editor and change the gap coefficients to any values you wish.
For this exercise we will use the nucleic acid sequence alignment that you created using
PileUp. Select the “.msf” file from your Main List, move your cursor to Functions and
select ProfileMake from the Multiple Comparison Menu.
From the ProfileMake Window, click on the Options Button.
From the options Menu, select “exponential weighting” and “give a weight of 0”. Close the
Options Menu and select RUN from the ProfileMake window.
To view a sample ProfileMake output file, click here. (link to profilemake_output.txt)
ProfileGAP
ProfileGap makes an optimal alignment between a profile and one or more sequences.
Profile analysis is a sequence comparison method for finding and aligning distantly
related sequences. The comparison allows a new sequence to be aligned optimally to a family of
similar sequences. The comparison uses a scoring matrix (a derivative of the Dayhoff
evolutionary distances table or PAM matrix) and an existing optimal alignment of two or more
similar protein sequences. The group or "family" of similar sequences are first aligned together to
create a multiple sequence alignment. The information in the multiple sequence alignment is then
represented quantitatively as a table of position-specific symbol comparison values and gap
penalties. This table is called a profile.
The similarity of new sequences to an existing profile can be tested by comparing each
new sequence to the profile with the same algorithm used to make optimal alignments. To
understand how this is done we must first recall what alignment algorithms do. Alignment
algorithms find alignments between two sequences that maximize the number of matches and
minimize the number of gaps. The match, for any pair of symbols being compared, is really a
value that comes from a scoring matrix that contains a value for every possible pair of sequence
symbols. Gaps are given penalties in the same units as the values in the scoring matrix. The best
alignment is then simply defined as the alignment for which the sum of the scoring matrix values
minus the gap penalties is maximal.
So how does alignment work when a sequence is being aligned to a profile? Each row in
the profile corresponds to a position in the original multiple sequence alignment. Each possible
sequence symbol has a value (a column) in each row of the profile. The comparison of a
sequence symbol to any row of the profile defines a specific value or "profile comparison value."
The best alignments of a sequence to a profile are found by aligning the symbols of the sequence
to the profile in such a way that the sum of the profile comparison values minus the gap penalties
is maximal. The profile also contains gap coefficients that are specific for each position so the
penalty for inserting a gap in one part of the alignment might be more or less than in another part.
The position-specific gap coefficients penalize gaps in conserved regions more heavily than gaps
in more variable regions.
The profile contains a consensus sequence for the display of alignments of other
sequences to the profile. The consensus sequence character corresponds to the highest value in
the row. Since the table on which the profile is based is usually the Dayhoff evolutionary distance
table, the consensus residue is the residue that has the smallest evolutionary distance from all of
the residues in that position of the alignment rather than simply the most frequent residue at that
position.
Looking for Structural Motifs with Profiles
Gribskov, et al. (CABIOS 4; 61-66 (1988)) have aligned the sequences from a number of
known protein structural motifs and calculated a group of profiles from these alignments.
ProfileScan compares any new protein sequence to each of the profiles in this motif database to
find out if any of these known motifs occur in the protein. This is one of the few techniques that
can reliably predict the location of structural features in protein sequences.
Database Searching with Profiles
A search of the database using a profile as a probe involves making an optimal alignment
of every sequence in the database to the profile and listing the alignments for which the
alignment score is outstanding.
The profile method has several advantages over most sequence comparison methods.
profile represents the common characteristics of a family of similar sequences where any single
sequence is just one realization of the family's characteristics. Since the profile represents the
alignment of a number of known sequences, it contains information that defines where the family
of sequences is conserved and where it is variable. The comparison of a new sequence to a
profile search can emphasize similarity to conserved regions while tolerating diversity in variable
regions. A database search can be more sensitive since each sequence in the database is
compared to more generalized information than is possible in searches based on pairwise
comparisons between two sequences.
Conventional database searching methods require some minimal level of sequence identity
between the sequences for any signal to be generated. The profile search, since it is based on
quantitative symbol comparisons, can find similarities between sequences with little or no
sequence identity.
The alignment of a sequence to a profile is inherently more sensitive since the whole
surface of comparison can be used to find the optimal alignment. Conventional methods of
searching like the Wilbur and Lipman method use scores that come from one or a small number
of adjacent diagonals. The aligned sequences of many protein families suggest that gaps are
frequent even in very similar proteins.
Experiments Confirm the Sensitivity of Profile Searching
Experiments reported by Gribskov et al. (Proc. Natl. Acad. Sci. USA 84; 4355-4358
(1987)) show that searching the database with a globin profile creates a distribution of alignment
scores that more clearly distinguishes known globins from unrelated sequences. Even globins
distantly related to the group used to make the profile were clearly distinguished from non-globin
sequences. The non-random part of the distribution of the alignment scores also contained a large
number of credibly "globin-like" sequences that were not identified when conventional database
searching algorithms were used.
For comparison, the authors searched the PIR protein sequence database with the LipmanPearson FASTP program (almost identical to FastA) using human alpha hemoglobin as a probe.
The FASTP program selected 244 of the 271 globins in the database. The leghemoglobins could
not be clearly distinguished from non-globin sequences.
Steps in Profile Searching
Profile searching has four steps: assembly of a family of related sequences into a multiple
sequence alignment with PileUp, construction of a profile from the alignment with the program
ProfileMake, comparison of the profile to a database of sequences with ProfileSearch, and finally
display of the best similarities found with ProfileSegments. The starting point for the creation of
a profile is a sequence or group of aligned sequences. This probe is generally a group of
functionally related proteins that have been aligned with tools such as PileUp. A profile,
however, can be created from a single sequence.
The profile is then calculated from the multiple sequence alignment with the program
ProfileMake. The profile contains position-specific gap coefficients based on the position and
length of the gaps in the aligned sequences. The gap and gap length penalty coefficients are
higher in regions in which no gaps are observed in the aligned sequences, and lower where gaps
are observed. When a sequence is aligned to a profile, gaps will tend to be placed in the same
regions they occur in the aligned sequences used to generate the profile.
Profiles, once generated, are provided as the input to ProfileSearch along with a sequence
specification like SwissProt:* (the search set). ProfileSearch aligns each sequence in the search
set to the profile and makes a list of the sequences with the best alignment scores.
The list is a file of sequence names suitable for input to ProfileSegments which will make
and display an optimal alignment of each sequence in the list to the profile consensus sequence.
When you have identified a new sequence that belongs to the sequence family from which your
profile was calculated, you can align it to the whole multiple sequence family with ProfileGap.
A sequence may be compared to a library of defined profiles, representing known sequence and
structural features, with ProfileScan.
References
1. Gribskov, M., McLachlan, A. D., and Eisenberg, D. (1987). Profile
Analysis: Detection of Distantly Related Proteins. Proceedings of the National Academy
of
Sciences USA 84; 4355-4358.
2. Gribskov, M., Homyak, M., Edenfield, J., and Eisenberg, D. (1988). Profile
Scanning for Three-Dimensional Structural Patterns in Protein Sequences. Computer
Applications in the Biosciences 4; 61-66.
3. Gribskov, M. and Eisenberg, D. (1989). Detection of Protein Structural
Features With Profile Analysis. In Techniques in Protein Chemistry, (pp; 108-117),
Academic Press, San Diego, California, USA.
4. Gribskov, M., Luethy, R., and Eisenberg, D. (1989). Profile Analysis. In
Methods in Enzymology, 183; (pp. 146-159), Academic Press, San Diego, California,
USA.
ProfileGap requires a profile as one of its input files. You can create profiles from aligned
sequences by means of the ProfileMake program. In the ProfileDir directory, GCG provides a
large number of amino acid profiles derived from the PROSITE database.
ProfileGap accepts as its other input one or more sequences of the same type as the sequences
used to create the profile. You can specify multiple sequences by using a sequence specification
with an asterisk (*) wildcard, for example GenEMBL:*. The function of ProfileGap depends on
whether your input sequence(s) are protein or nucleotide. Programs determine the type of a
sequence by the presence of either Type: N or Type: P on the last line of the text heading just
above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for
information on how to change or set the type of a sequence.
For this exercise we will use the nucleic acid “.msf” file that you created with PileUp and
the profile created from this alignment using ProfileMake. Select the “.msf” file and move
your cursor to Functions and select ProfileGAP from the Multiple Comparison Menu (see
next page).
The main window for ProfileGAP will appear. Click on the Profile button. A Choose File
menu will appear. Select the profile that you created from ProfileMake (there should be
only 1 choice).
Next, click on the Options button. Select “globally align…”, “don’t penalize gaps...”, and
“Set thresholds.....”. In the Set thresholds box type “|” and then close the Options Menu.
Select RUN in the main ProfileGAP window.
To view a sample ProfileGap output file, click here. (link to profilegap_output.txt)
OVERLAP and NoOVERLAP
Overlap compares two sets of DNA sequences to each other in both orientations using a
WordSearch style comparison.
Overlap accepts two sets of sequences as input and uses the algorithm of Wilbur and Lipman
(Proc. Natl. Acad. Sci. USA 80; 726-730 (1983)) to compare each sequence of the first set with
each sequence of the second set, in both orientations. Thus, Overlap runs a WordSearch
reiteratively, using the first set of sequences as queries. Unlike WordSearch, Overlap looks for
overlaps between sequences rather than simply regions of similarity. An overlap is a highly
similar region between two sequences that runs the entire length of a register of comparison.
Overlap lists the position, length, and stringency of discovered overlaps in an output file.
Overlap accepts two separate groups of multiple (one or more) nucleotide sequences as input.
You can specify multiple sequences in a number of ways: by using an MSF or RSF file, for
example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for
example GenEMBL:*.
Overlap identifies sequence similarities using a Wilbur and Lipman-style word comparison (see
the WordSearch entry in the Program Manual for information regarding the details of this
algorithm and considerations about using this search). Overlap differs from WordSearch in that it
accepts a set of query sequences as input and reports overlaps rather than regions of similarity.
Overlap removes gap characters (. and ~) from the input sequences before comparing them.
The output file lists the length, position, and percent similarity (ratio) of each overlap in
descending order of sequence and overlap length. It also gives the orientation of each sequence.
To view a sample Overlap output file, click here. (link to overlap_output.txt)
NoOverlap identifies the places where a group of nucleotide sequences do not share any
common subsequences.
This program determines if there are regions where a group of nucleotide sequences do not share
any common subsequences. Witkiewicz, Bolander, and Edwards assert that hybridization probes
specific enough to detect individual members of a gene family can be prepared if a region 100
bases or longer can be found that does not have a perfect match of nine or more bases with any
other member of the family (BioTechniques 14(3); 458-463). NoOverlap is designed to find out
if such regions occur in a group of sequences.
To use NoOverlap, you name a group of related sequences in which you want to find regions that
do not share any 9-mer with any other sequence in the group. The resulting output is a list of the
sequences that have such regions and the coordinates of the regions where no common 9-mers
occur.
NoOverlap accepts multiple (two or more) nucleotide sequences as input. You can specify
multiple sequences in a number of ways: by using an MSF or RSF file, for example
project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example
GenEMBL:*.
NoOverlap makes an output file with a list of all the non-overlapping regions in every sequence
that meet your requirements for word size and length. To view an example NoOverlap output
file, click here. (link to nooverlap_output.txt)