Download PROBE: A computer program to scan DNA sequence databases for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Deoxyribozyme wikipedia , lookup

Transcript
292s
Biochemical SocietyTransactions ( 1 992) 20
PROBE A computer program to scan DNA sequence databases
for the existence of potential probe sequences in DNA
Table 1 A sample output from the program PROBE.
Probe sequence : TCCTGTGGATGGGGATCGGCAGTGCCTGCA
: j:\pirform\pri.seq
File scanned
Maximum permitted mismatch : 4
IAN G. GILES
Department of Biochemistry, University of Southampton,
Southampton, SO9 3TU, U.K.
Chemically synthesised oligonucleotides, designed from a
knowledge of either DNA or protein sequence, are being used
more frequently as probes directed towards the genome, for in situ
hybridisation or as primers for PCR. There are computer
programs available to assist in the selection of the most suitable
region to use [1-4] but there is a lack of specifically designed
software that can look through a DNA database to check whether
there are other sequences that might also hybridize.
There are weaknesses in the normal programs used to scan
through databases. For example the widely used FASTA program
[51 will be able to locate regions that are not perfect matches, but
it scans only the DNA strand stored in the database (in general
that corresponding to the mRNA strand). Ideally a program to
scan a potential probe sequence should look at both strands.
Furthermore FASTA is unable to use the IUB redundancy codes
for nucleic acids, These are needed in the early stage of probe
design when starting from protein sequence information. The
program EMBLScan [6] is a very good program to search a DNA
database for very similar sequences and is able to scan for exact
matches in either a single or double strand of DNA. However, it
is not suited for locating areas containing mismatches.
Consequently a program able to scan rapidly through a DNA
database permitting a given number of mismatches to accumulate
was written. In addition the program was designed to be able to
scan both strands of the DNA in the database on the fly by
generating the reverse complement of the input probe.
The program automatically detects the format of the sequence
file containing the probe sequence using the GETSEQ routine
previously described [7]. The version of GETSEQ used has been
extended to support EMBL, GENBANK, Staden (DButil), FASTA,
CODATA and PIRFORM formatted files. Any file not recognised
is read in verbatim. This routine is also used to input from the
DNA database and has no difficulties accessing the files on the
EMBL CD-ROM. When the CDROM is used it is preferable to
use the NBRF format database files in the PIRFORM subdirectory
as the program executes more rapidly. The program does not
have an upper limit on the size of a sequence in the database, as
large entries are read in segments.
The program executes quite rapidly as a result of the algorithm
used. Searching for any probe length fragment from the data-base
entry is only continued whilst less than the maximum number of
mismatches has been encountered. As soon as the permitted
number of mismatches has been exceeded the scan using that
particular subsequence is aborted. Consequently the program
run time is not linearly dependent on the length of the probe
sequence. To speed the program further each base is bitwise
encoded. This means that a search using a redundant probe takes
no longer than looking for identity to a non-redundant probe.
The efficiency of the program can be illustrated using a 30
nucleotide probe which was scanned through the primate division
of release 29 of the EMBL database (13,658,513 bases in 11,424
entries) on CD-ROM. The times taken for the programs
EMBLScan, FASTA and PROBE on a 20MHz 386SX PC were lmin,
lhr 30min and 30min respectively. In this comparison we can see
that EMBLScan is very fast, but as already mentioned is unable to
deal with mismatches. PROBE runs faster than FASTA in this
context and it scans both DNA strands in a single pass.
PROBE runs on an IBM-compatible PC and was written in ANSI
standard ' C using the Turbo C compiler. The code should port
to other platforms with no difficulty. The program can either be
used interactively, when the program prompts the user for the
input data required, or it can accept this information from the
command line. This latter feature means that it is easy to set up
batch files enabling a number of scans to be performed without
>HSAFX?.
318
>
- Human aromatase m A , complete cds
TCCTGTGGATGGGGATCGGCAGTGCCTGCA
*********r*ttt*+tttt**.****,**
total matches: 1
>HSAROMAT
- Human M A for aromatase
synthetase)
3 1 8 > TCCTGTGGATGGGGATCGGCAGTGCCTGCA
(estrogen
...............................
total matches: 1
>HSAFlE'450
- H u m n mRNA for aromatase P-450
2 4 3 > TCCTGTGGATGGGGATCGGCAGTGCCTGCA
..............................
total matches: 1
>HSCY4ARO
- Human aromatase system cytochrome P-450
(P45OXIX) m A , complete
2 2 1 > TCCTGTGGATGGGGATCGGCAGTGCCTGCA
*t**t*t***t*t*t*****tt*********
total matches: 1
>HSCYAR03
Human aromatase cytochrome P-450 gene, exon
-
3
194
> TCCTGTGGATGGGGATCGGCAGTGCCTGCA
..............................
total matches: 1
The database contained 1 3 6 5 8 5 1 3 residues in 1 1 4 2 4 entries.
Overall a total of 5 matches in 5 sequences were found
direct user intervention.
This program is available from the author at the above address,
or via e-mail (address [email protected]).
I thank The Wessex Medical Trust for financial support.
1.
2.
3.
4.
5.
6.
7.
Rychlik, W. & Rhoads, R.E. (1989) Nucleic Acids Res. 16,
8543-8551
Lowe,T., Sharefkind., Yang, S.Q. & Dieffenbach, C.W.
(1990) Nucleic Acids Res. 18, 1757-1761
Lucas, K., Busch, M., Mossinger, S. & Thompson, ].A.
(1991) Comput. Applic. Biol. Sci. 7, 525-529
Bloomfield, M. & Giles, I.G. (1992)Biochem. SOC.Trans in
press
Pearson, W.R. & Lipman, D.J. (1988) Proc. Natl. Acad. Sci.
USA 85,2444-2448
Higgins, D.G. & Stoehr, P.J. (1991)EMBLScan User Guide,
EMBL Data Library, Heidelberg, Germany.
Cockwell, K.Y. & Giles, I.G. (1989) Comput. Applic. Biol.
Sci. 5, 227-232