Download dna sequence analysis: pairwise comparison

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Non-coding DNA wikipedia , lookup

Matrix-assisted laser desorption/ionization wikipedia , lookup

Molecular ecology wikipedia , lookup

Proteolysis wikipedia , lookup

Biochemistry wikipedia , lookup

Biosynthesis wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein structure prediction wikipedia , lookup

Point mutation wikipedia , lookup

Genetic code wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
DNA/PROTEIN SEQUENCE
ANALYSIS: PAIRWISE COMPARISON
Sequence comparison techniques fall into two categories: pairwise or multiple. Pairwise
comparison provides computer tools to directly compare two sequences, either nucleic acid or
peptide. These tools are very useful when verifying sequence data, cloning projects, PCR
analysis, and many more. Anytime you would like to compare two sequences, these are the tools
to use. The programs are similar with respect to input but vary greatly regarding the output
formats. This tutorial will cover the following programs: GAP, BESTFIT, FRAMEALIGN,
COMPARE, DOTPLOT, and GAPSHOW.
The sequences you will use are as follows:
M19993
M15034
P08689
gb_ro - rodent database within GenBank
gb_ro - rodent database within GenBank
UniProt database
GAP
When you want an alignment that covers the whole length of both sequences, use GAP.
Gap uses the algorithm of Needleman and Wunsch to find the alignment of two complete
sequences that maximizes the number of matches and minimizes the number of gaps. Gap reads a
scoring matrix that contains values for every possible GCG symbol match. Gap finds an
alignment with the maximum possible quality where the quality of an alignment is equal to the
sum of the values of the matches (each match scored with the scoring matrix) less the gap
creation penalty times the number of internal gaps and less the gap extension penalty times the
total length of the internal gaps. The alignment found by Gap is, therefore, sensitive to the
scoring matrix values and the gap penalties. There is no penalty if either sequence is shifted to
the place where the alignment begins unless end gaps are penalized by using the command-line
parameter -ENDWeight.
Gap considers all possible alignments and gap positions and creates the alignment with the
largest number of matched bases and the fewest gaps. You provide a gap creation penalty and a
gap extension penalty in units of matched bases. In other words, Gap must make a profit of gap
creation penalty number of matches for each gap it inserts. If you choose a gap extension penalty
greater than zero, Gap must, in addition, make a profit for each gap inserted of the length of the
gap times the gap extension penalty. Gap uses the alignment method of Needleman and Wunsch
(J. Mol. Biol. 48; 443-453 (1970)) that has been shown to be equivalent to Sellers.
1
Gap accepts two individual nucleotide sequences or protein sequences as input. The function of
Gap depends on whether your input sequence(s) are protein or nucleotide. Programs determine
the type of a sequence by the presence of either Type: N or Type: P on the last line of the text
heading just above the sequence. Input sequences may not be more than 30,000 symbols long.
The program attempts to allocate enough computer memory to align the input sequences. In the
worst case, where the two sequences being aligned are unrelated, the allocation is proportional to
the product of the lengths of the two input sequences. However, in many cases where the
sequences being aligned are more closely related, the computer can determine an optimal
alignment using less memory. When memory on your computer is limiting and the program
cannot allocate all of the memory it needs to align long sequences, it completes the alignment in
whatever memory it can allocate and displays the message *** Alignment is not guaranteed to be
optimal ***. Because the criteria used in the calculation for guaranteeing an optimal alignment
are very stringent, the alignment often may be optimal even if this message is displayed.
For the following exercise your SeqLab screen should look like this. We suggest opening a new
list. You can then delete the entire list from directory or save it as a HELP tool when performing
additional comparisons.
2
To start the GAP program, highlight the two nucleotide sequences, move to functions and
select GAP from the Pairwise Comparison Menu:
The following window will appear. The top lines contain the sequences that are being compared.
3
Click the Options… button. For this exercise, select “Set thresholds...” and enter “4” in the
window, de-select “Abbreviate large gaps...”, and select both “New sequence file for....”
options. Close the Options window and select RUN.
4
For most sequences, the default settings will provide a reasonable alignment. Gap is capable of
ignoring a region of excellent similarity or similarity between two sequences if it can produce an
alignment with equal or better quality in some other way. BestFit is a better tool to search for
weak or unknown similarity or similarity that you suspect is not coextensive along the sequences.
5
It is extremely important that you think formally about what Gap does. Using Gap rather than
BestFit implies that you want an alignment where neither sequence is truncated.
To see a sample output file, click here. (link to gap_output.txt)
GAPSHOW
GapShow displays an alignment by making a graph that shows the distribution of similarities and
gaps. The two input sequences should be aligned with either Gap or BestFit before they are given
to GapShow for display.
BestFit and Gap make optimal alignments of sequences by adding gaps to maximize the number
of matches. Gap and BestFit normally display the alignments, but they can also write the aligned
sequences (with gaps inserted) into new sequence files. GapShow reads these files and plots the
distribution of the differences or similarities in the alignment graphically. The sequences are
represented by horizontal lines. The horizontal lines have openings at points where there are gaps
in either sequence. Regions of interest, such as coding regions, can be shown outside these lines.
A large vertical line between the sequences can indicate either a difference or similarity. If
differences are being shown, gaps are also depicted with short vertical lines to convey that the
gap is a difference between the two sequences.
GapShow accepts two aligned individual nucleotide sequences or protein sequences as input.
The function of GapShow depends on whether your input sequence(s) are protein or nucleotide.
Programs determine the type of a sequence by the presence of either Type: N or Type: P on the
last line of the text heading just above the sequence.
Enter the two “.gap” sequences from the Output Manager into your Main List. Select both
“.gap” sequences in the Main List, move to Functions menu and from the Pairwise
Comparison menu select GAPSHOW.
6
The following window will appear. Make sure that your window contains sequences that end
with “.gap”.
7
Click on the Options button. When the Options window appears, select “plot for
similarities” and “represent the sequences....”. Close the Options window and click on Run
in the GAPSHOW window.
The output will be loaded into your Output Manager. Open your Output Manager, select the
“.figure” file and Display the output.
The following graph will be displayed. The thicker bars represent the similarities while the
thinner line represents the gaps.
8
The BESTFIT program also will generate “.gap” sequences that can be used to generate a
GAPSHOW plot. To select this option, in the BestFit option window, select the “New
sequence....” toggles and the sequences will be loaded into your Output Manager.
To print the alignment, go to output manager and select the “.pair” output file and click on
print. Follow the printing instructions found in the the printing FAQ.
To better understand the statistics of the alignment and gain a better understanding of the results
obtained for your pairwise comparison, refer to the paper by Needleman and Wunsch (J. Mol.
Biol. 48: 443-453 (1970)).
BESTFIT
BestFit is the most powerful method in the Wisconsin Package(TM) for identifying the best
region of similarity between two sequences whose relationship is unknown.
BestFit makes an optimal alignment of the best segment of similarity between two sequences.
Optimal alignments are found by inserting gaps to maximize the number of matches using the
local homology algorithm of Smith and Waterman.
BestFit inserts gaps to obtain the optimal alignment of the best region of similarity between two
sequences, and then displays the alignment in a format similar to the output from Gap. The
9
sequences can be of very different lengths and have only a small segment of similarity between
them. You could take a short RNA sequence, for example, and run it against a whole
mitochondrial genome.
BestFit accepts two individual nucleotide sequences or protein sequences as input. The function
of BestFit depends on whether your input sequence(s) are protein or nucleotide. Programs
determine the type of a sequence by the presence of either Type: N or Type: P on the last line of
the text heading just above the sequence. If your input sequences are peptide sequences, this
program uses a scoring matrix, blosum62.cmp, with comparison values derived from a study of
substitutions between amino acid pairs in ungapped block of aligned protein segments as
measured by Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)).
Algorithm:
BestFit uses the local homology algorithm of Smith and Waterman (Advances in Applied
Mathematics 2; 482-489 (1981)) to find the best segment of similarity between two sequences.
BestFit reads a scoring matrix that contains values for every possible GCG symbol match. The
program uses these values to construct a path matrix that represents the entire surface of
comparison with a score at every position for the best possible alignment to that point. The
quality score for the best alignment to any point is equal to the sum of the scoring matrix values
of the matches in that alignment, less the gap creation penalty times the number of gaps in that
alignment, less the gap extension penalty times the total length of all gaps in that alignment. The
gap creation and gap extension penalties are set by you. If the best path to any point has a
negative value, a zero is put in that position.
After the path matrix is complete, the highest value on the surface of comparison represents the
end of the best region of similarity between the sequences. The best path from this highest value
backwards to the point where the values revert to zero is the alignment shown by BestFit. This
alignment is the best segment of similarity between the two sequences.
For nucleic acids, the default scoring matrix has a match value of 10 for each identical symbol
comparison and -9 for each non-identical comparison (not considering nucleotide ambiguity
symbols for this example). The quality score for a nucleic acid alignment can, therefore, be
determined using the following equation:
Quality = 10 x TotalMatches + -9 x TotalMismatches - (GapCreationPenalty x GapNumber) (GapExtensionPenalty x TotalLengthOfGaps)
The quality score for a protein alignment is calculated in a similar manner. However, while the
default nucleic acid scoring matrix has a single value for all non-identical comparisons, the
default protein scoring matrix has different values for the various non-identical amino acid
comparisons. The quality score for a protein alignment can therefore be determined using the
following equation (where Total(AA) is the total number of A-A (Ala-Ala) matches in the
10
alignment, CmpVal(AA) is the value for an A-A comparison in the scoring matrix, Total(AB) is
the total number of A-B (Ala-Asx) matches in the alignment, CmpVal(AB) is the value for an AB comparison in the scoring matrix, ...) :
Quality = CmpVal(AA) x Total(AA) + CmpVal(AB) x Total(AB) + CmpVal(AC) x Total(AC)
+ CmpVal(ZZ) x Total(ZZ) - (GapCreationPenalty x GapNumber) - (GapExtensionPenalty x
TotalLengthOfGaps)
BestFit Always Finds Something
BestFit always finds an alignment for any two sequences you compare -- even if there is
no significant similarity between them! You must evaluate the results critically to decide
if the segment shown is not just a random region of relative similarity.
*****
The Segments Shown Obscure Alternative Segments
BestFit only shows one segment of similarity, so if there are several, all but one are
obscured. You can approach this problem with graphic matrix analysis (see the Compare
and DotPlot programs). Alternatively, you can run BestFit on ranges outside the ranges of
similarity found in earlier runs to bring other segments out of the shadow of the best
segment.
The Best Fit is Only One Member of a Family
Like all fast gapping algorithms, the alignment displayed is a member of the family of
best alignments. This family may have other members of equal quality, but will not have
any member with a higher quality. The family is usually significantly different for
different choices of gap creation and gap extension penalties. See the
CONSIDERATIONS topic in the entry for the Gap program in the Program Manual to
learn more about how to assign gap creation and gap extension penalties.
Default Gap Penalties are Specific to Each Scoring Matrix
BestFit chooses default gap creation and extension penalties that are appropriate for the
scoring matrix it reads. If you select a different scoring matrix with the -MATRix
command-line parameter, the program will adjust the default gap penalties accordingly.
(See Appendix VII for information about how to set the default gap penalties for any
scoring matrix.) You can use -GAPweight and -LENgthweight to specify alternative gap
penalties if you don't want to accept the default values.
11
The Public Scoring Matrix for Nucleic Acid Comparisons is Very Stringent
The scoring matrix swgapdna.cmp awards matches +10 and penalizes mismatches -9 so
the segments found may be very brief. This penalty means that the alignment cannot be
extended by even two bases to pick one extra match. The scoring matrix used by Smith
and Waterman, when local alignments were first described, used -.33 for the mismatch
penalty and +1 for the match score. You can use Fetch to copy randomdna.cmp and
rename it swgapdna.cmp to use a value of -3 for each mismatch and +10 for each match,
or use nwsgapdna.cmp, which has no mismatch penalty at all.
Rapid Alignment
When possible, BestFit tries to find the optimal alignment very quickly. If this rapid
alignment is not unambiguously optimal, BestFit automatically realigns the sequences to
calculate the optimal alignment. When this occurs, the monitor of alignment progress on
your terminal screen (Aligning...) is displayed twice for a single alignment.
To run BESTFIT, select the gb_ro:m15034 and the gb_ro:m19993 sequences from the
Main list window, move your cursor to Functions and select BestFit from the Pairwise
Comparison Menu:
The BestFit window will appear. Click on the Options button. The Option Window is very
similar to the GAP Option window. From this window, select “Set thresholds....” and type
12
“4” into the highlighted window, de-select “Abbreviate output.....”, and select both “New
sequence...”. Change the name of the output file names by adding a “-1” behind the
accession number (ex. M19993-1.gap). This will distinguish the GAP and BestFit result
sequences.
13
The results obtained from the BestFit program provides the user with the BEST alignment found.
The best alignment may not contain the entire sequence; in fact it usually does not. One
difference is that the ends of the genomic sequence have been left off the alignment. Only GAP
will provide full length sequences in the output alignment.
Once again, for most alignments the default options will provide a reasonable alignment.
To view a sample output file, click here. (link to bestfit_output.txt)
The .gap files can be used by the GAPSHOW program to provide a graphical representation of
the BestFit results. Follow the instructions found in the GAPSHOW section to produce the
following figure:
14
FRAMEALIGN
To compare a nucleic acid sequence to a peptide sequence use FrameAlign. This is a very
useful tool for verifying coding regions during cloning or sequencing projects.
FrameAlign creates an optimal alignment of the best segment of similarity (local alignment)
between a protein sequence and the codons in all possible reading frames of a nucleotide
sequence. Optimal alignments may include reading frame shifts.
FrameAlign inserts gaps to obtain the optimal local alignment of the best region of similarity
between a protein sequence and the codons in a nucleotide sequence. Because FrameAlign can
align the protein to codons in different reading frames of the nucleotide sequence, it can identify
sequence similarity even when the nucleotide sequence contains reading frame shifts.
In standard sequence alignment programs, you routinely specify gap creation and
extension penalties. In addition to these penalties, FrameAlign also allows you to specify a
separate frameshift penalty for the creation of gaps that result in reading frame shifts in the
nucleotide sequence. (See the ALGORITHM topic for a more detailed explanation of how gaps
are penalized).
15
ALGORITHM:
FrameAlign aligns a nucleotide sequence with a protein sequence. The alignment procedure is an
extension of the local alignment algorithm of Smith and Waterman (Advances in Applied
Mathematics 2; 482-489 (1981)) that is modified to determine the score of the best segment of
similarity between a protein sequence and the codons in a nucleotide sequence.
Scoring Matrix
To create the alignments, FrameAlign requires a scoring matrix that contains values for
matches between all possible amino acids and codons. FrameAlign derives this amino
acid - codon scoring matrix on the fly from a translation table and an amino acid
substitution matrix. The translation table contains a list of all possible codons for each
amino acid. The amino acid substitution matrix contains match values for the comparison
of all possible amino acids.
In the derived amino acid - codon scoring matrix, the value of a match between any
amino acid and any codon is the value of the match between the amino acid and the
translated codon in the amino acid substitution matrix. If a codon contains IUB nucleotide
ambiguity symbols, and all possible unambiguous representations of the codon translate
to the same amino acid (e.g. MGR always translates to arginine in the standard genetic
code), then the value of a match between that codon and any amino acid can be similarly
determined. If all possible unambiguous representations of the codon do not translate to
the same amino acid, then the value of a match between that codon and any amino acid is
0.
FrameAlign chooses default gap creation and extension penalties that are appropriate for
the scoring matrix it reads. If you select a different scoring matrix with the -MATRix
command-line parameter, the program will adjust the default gap penalties accordingly.
You can respond to the program prompts or use –GAPweight and -LENgthweight to
specify alternative gap penalties if you don't want to accept the default values.
Protein-Nucleotide Alignment
FrameAlign uses the values in the amino acid - codon scoring matrix to determine the
score of the best alignment between the protein and nucleotide sequences. If you consider
a graph, or path matrix, with the nucleotide sequence placed on the X axis and the protein
sequence placed on the Y axis, then every point on the path matrix represents the best
alignment between the sequences that ends at that point. For any point on the path matrix,
the X coordinate is the first nucleotide of the final codon in the alignment, and the Y
coordinate is the final amino acid in the alignment. Each possible alignment end point is
associated with a path, which is a series of steps (insertions, deletions, matches) through
the path matrix required to create the alignment. Each step has its own score, and the
scores for all the steps in an alignment path determine the quality score for the alignment.
The quality score for an alignment is equal to the sum of the scoring matrix values of the
16
matches in the alignment, minus the gap creation penalty multiplied by the number of
gaps in the alignment, minus the frameshift penalty multiplied by the number of gaps in
the alignment that change the reading frame, minus the gap extension penalty multiplied
by the total length of all gaps in the alignment. (You can set the value for each of the
penalties.)
Quality = SUM(scoring matrix values of the matches in the alignment) – (gap creation
penalty x number of gaps in the alignment) – (frameshift penalty x number of gaps in
the alignment that change the reading frame) – (gap extension penalty x total length of all
gaps in the alignment)
For example, the following protein-nucleotide alignment consists of six steps:
1 UGUUGUGUUCG....UGGUGG 17
||||||:::
||||||
1 CysCysValArgIleTrpTrp 7
The first two steps are UGU-Cys matches. The third step is an GUU-Val match. The
fourth step is a four nucleotide deletion. The last two steps are UGG-Trp matches. The
quality score for this alignment is the sum of the scoring matrix values for two UGU-Cys
matches, one GUU-Val match, and two UGG-Trp matches, minus one gap creation
penalty, minus four gap extension penalties, minus one frameshift penalty.
Matches between an amino acid and a partial codon, like
CG. for Arg
in the above example, do not add any match value to the alignment score. By convention,
all gap characters in partial codons are placed at the end of the codon. For example, the
partial codon “CG.” in the above example will never be written as “C.G”. If the best
alignment ending at any point has a negative value, a zero is put at that position of the
path matrix; otherwise, the quality score for the alignment is put at that position. After the
path matrix is completely filled, the highest value in the matrix represents the score of the
best region of similarity between the sequences (optimal local alignment). This highest
value is reported as the comparison score between the nucleotide and protein sequences.
The alignment itself can be reconstructed for display by following the best path from this
point of highest value backward to the point where the path matrix has a value of zero.
FrameAlign Always Finds Something
FrameAlign always finds an alignment for any protein and nucleotide sequences you
compare, even if there is no significant similarity between them. You must evaluate the
results critically to decide if the segment shown is not just a random region of relative
similarity.
17
FrameAlign Shows Only a Single Segment of Similarity
-FrameAlign shows only one optimal alignment between a protein sequence and a
nucleotide sequence. There are reasons why you might want to evaluate several optimal
and suboptimal alignments.
- If there are several disjoint segments of similarity, the selection of only a single segment
for display does not provide a comprehensive view of the relationship between the
nucleotide and protein sequences.
- The alignments displayed by FrameAlign are sensitive to your choices for the scoring
matrix and gap penalties. If you vary these choices even slightly, FrameAlign may
calculate different optimal alignments for the same segment of similarity between the
sequences. If FrameAlign were able to display multiple and suboptimal alignments of the
same region, you would be able to use the variation among the different alignments to
determine which portions of the alignments were reliably determined.
FrameAlign, GAP and BestFit are very similar programs and contain many of the same optional
parameters found on the Options page. To begain using FrameAlign, select the
UniProt:p08689 peptide sequence and the gb_ro:m15034 nucleic acid sequence. Move your
cursor to the Functions Menu and select FrameAlign from the Pairwise Comparison Menu.
18
The FrameAlign program window will be loaded.
Click on the Options button and make sure that the sequences we are analyzing are in the
sequence windows. Select “Set thresholds...” and type “4” in the highlighted window.
Also, select “New sequence file...” for each sequence (you should change the m15034.gap to
m15034-2.gap). Close the option window and click on Run in the main FrameAlign
window. The options page will look familiar. The FrameAlign, GAP, and BestFit pages contain
many of the same parameters to set.
These two sequences are the cDNA and protein sequence for the rat androgen-binding protein
(ABP). These sequences produce a perfect alignment. Sequences that you run may not be this
perfect, remember that FrameAlign, like BestFit, displays only the best alignment. The output
sequences can be used to run other GCG programs that are expecting 2 nucleic acid or 2 peptide
sequences (ex. GAPSHOW).
To view a sample output file, click here. (link to framealign_output.txt)
COMPARE
Compare compares two protein or nucleic acid sequences and creates a file of the points of
similarity between them for plotting with DotPlot. Compare finds the points using either a
19
window/stringency or a word match criterion. The word comparison is 1,000 times faster than
the window/stringency comparison, but somewhat less sensitive.
Compare makes a file with the coordinates of each point where two sequences are similar. The
sequences are compared in every possible register and a point is added to the file wherever some
match criterion for similarity is met. The match criterion can be met in two different ways:
1. The standard way compares two sequences in every register, searching for all the
places where a given number of matches (stringency) occur within a given range
(window). See Maizel and Lenk (1981) "Enhanced Graphic Matrix Analysis of
Nucleic Acid and Protein Sequences" Proc. Natl. Acad. Sci. USA 78;
7665-7669 for a description of the matrix analysis of biological sequences.
2. The other way to find points of similarity is to search for short perfect matches of
some set length. Short perfect matches are referred to as words. The word
comparison between two sequences is about 1,000 times faster than the
window/stringency match described above, but it requires that the sequences
contain short perfect matches for any similarity to be found. Word comparison is
discussed in detail by Wilbur and Lipman (1983) "Rapid Similarity Searches of
Nucleic Acid and Protein Data Banks" Proc. Natl. Acad. Sci. USA 80; 726-730.
The authors refer to a word as a k-tuple. Compare does a word comparison if it
is run with the command-line parameter -WORdsize.
Compare accepts two individual nucleotide sequences or protein sequences as input. The
function of Compare depends on whether your input sequence(s) are protein or nucleotide.
Programs determine the type of a sequence by the presence of either Type: N or Type: P on the
last line of the text heading just above the sequence.
No more than 200,000 points may be produced in the plot file. The point files can be quite large
and should be deleted as soon as they have been examined. Window must be between 1 and 100,
or if word comparison is done, the word size must be between 1 and 25.
ALGORITHM:
Compare makes a file of every point where two sequences are similar according to a set match
criterion. The points are the Cartesian coordinates of each point of similarity in units of the
original sequence coordinates. If the window is greater than 1, the point recorded by Compare is
in the middle of the window.
Window/Stringency Comparisons
For window/stringency comparisons, Compare reads a scoring matrix that defines a
match value for every possible GCG symbol comparison. Compare then slides the
vertical sequence along the horizontal in order to generate every possible register of
comparison. For each register, Compare slides a window along the pair of sequences. The
20
match values for each pair of symbols within the window are summed to determine a
score for the window at each window position. When the
score is greater than or equal to the stringency, then the match criterion has been met and
a point is added to the file at the position of the middle of the window on both axes.
When the window has no integral center (windows of even length), then Compare rounds
the coordinates up. If you have used the command-line parameter -ALL, then points are
added to the file at all of the positions within the window that have match values greater
than or equal to the average positive
non-identical comparison value in the scoring matrix (see -ALL in the OPTIONAL
PARAMETERS topic).
Word Comparisons
For word comparisons, you set a word length. Compare then slides the vertical sequence
along the horizontal in order to generate every possible register of comparison. For each
register, Compare slides a window whose size is equal to the word length along the pair
of sequences. If all of the symbols in the two sequences within the window are identical,
Compare puts a point in the file at the middle of the word's position in the two sequences.
If the word has no integral center (words of even length), then Compare rounds the
coordinates up.
Alphabet
The parameter alphabet that appears in the output is the number of symbols in the
alphabet that could make up each word. The alphabet contains four symbols for nucleic
acids and up to 31 for peptide sequences.
Try a Word Comparison First
Word comparisons are very fast, so run Compare with the command-line parameter WORdsize first. Usually, this pilot run gives you a rough idea what the dot-plot for the
more sensitive window/stringency comparison is going to look like. See the two plots in
the Program Manual entry for DotPlot for examples of each type of comparison.
Setting Window and Stringency
A window 21-symbols wide with a stringency of 14 is a good place to start when
comparing nucleic acid sequences that have very few ambiguity codes in them. The
number of points you get should be of the same magnitude as the number of symbols in
your sequences. We have had good results with a window of 30 and a stringency of 11 for
peptide sequence comparison. You can use -LIMit to stop the program before the number
of points gets unreasonable.
Batch Queue
21
Unless you are using the -WORdsize parameter, Compare is one of the few programs in
the Wisconsin Package that can take more than a few minutes to run. Therefore, large
comparisons should probably be run in the batch queue. You can specify that this program
run at a later time in the batch queue by using the command-line parameter -BATch. Run
this way, the program prompts you for all the required parameters and then automatically
submits itself to the batch or at queue. For more information, see "Using the Batch Queue"
in Chapter 3, Using Programs in the User's Guide. Very large comparisons may exceed the
CPU limit set by some systems. In practice you should probably limit the range of the
sequences compared to about 10,000 for each batch job.
Setting Word Size
You might try a word size of 6 for nucleic acid sequences of 1,000 bases and perhaps 8
for 10,000 bases. You can start with a word size of 2 or 3 for peptide-sequence
comparisons.
DotPlot
Dot-plotting is the best method in the Wisconsin Package(TM) for comparing two
sequences when you suspect that there could be more than one segment of similarity
between the two.
DotPlot makes a dot-plot with the output file from Compare, FoldRNA, or StemLoop. In
SeqLab, the program DotPlot can stand alone and will accept Compare files that have been
prepared on the command line. SeqLab integrates the Compare results directly into DotPlot.
Therefore, when you run Compare there are two results: 1. A matrix of all the points of
comparison between the two query sequences and, 2. A plot that displays the results of Compare
and shows regions of extended homology with solid vertical line.
To examine the utility of Compare and DotPlot, we will use the gb_ro:m15034 and the
gb_ro:m19993 sequences. Select both sequences in your Main List, move your cursor to
the Functions Menu and select Compare from the Pairwise Comparison Menu (see next
page).
22
The main Compare window will appear. As always, make sure that the sequences you selected
from the main list are the sequences the program is going to use for its analysis.
23
From the main Compare window, select “window/stringency comparison” and click Run.
As mentioned previously, there are 2 output forms that are loaded into your Output Manager.
The first output is in the format of a “XXXX.pnt” file. An example of this file is presented
below, but without the plot it is difficult to interpret. This is a list of all the points that contain
homology between the 2 query sequences.
To see a sample output file of this type, click here. (link to compare_output.txt)
The second output format is a “figure” file and contains the graphical output that interpreted the
point matrix and created this plot. The solid lines show the regions of homology between the
two sequences. The m19993 sequence is the genomic sequence for the ABP gene while the
m15034 sequence is the cDNA. Therefore, a total of 8 exons (the # of exons in the ABP cDNA)
should be represented in the DotPlot. With this example the program identifies all exons as solid
lines.
24
25