Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
292s Biochemical SocietyTransactions ( 1 992) 20 PROBE A computer program to scan DNA sequence databases for the existence of potential probe sequences in DNA Table 1 A sample output from the program PROBE. Probe sequence : TCCTGTGGATGGGGATCGGCAGTGCCTGCA : j:\pirform\pri.seq File scanned Maximum permitted mismatch : 4 IAN G. GILES Department of Biochemistry, University of Southampton, Southampton, SO9 3TU, U.K. Chemically synthesised oligonucleotides, designed from a knowledge of either DNA or protein sequence, are being used more frequently as probes directed towards the genome, for in situ hybridisation or as primers for PCR. There are computer programs available to assist in the selection of the most suitable region to use [1-4] but there is a lack of specifically designed software that can look through a DNA database to check whether there are other sequences that might also hybridize. There are weaknesses in the normal programs used to scan through databases. For example the widely used FASTA program [51 will be able to locate regions that are not perfect matches, but it scans only the DNA strand stored in the database (in general that corresponding to the mRNA strand). Ideally a program to scan a potential probe sequence should look at both strands. Furthermore FASTA is unable to use the IUB redundancy codes for nucleic acids, These are needed in the early stage of probe design when starting from protein sequence information. The program EMBLScan [6] is a very good program to search a DNA database for very similar sequences and is able to scan for exact matches in either a single or double strand of DNA. However, it is not suited for locating areas containing mismatches. Consequently a program able to scan rapidly through a DNA database permitting a given number of mismatches to accumulate was written. In addition the program was designed to be able to scan both strands of the DNA in the database on the fly by generating the reverse complement of the input probe. The program automatically detects the format of the sequence file containing the probe sequence using the GETSEQ routine previously described [7]. The version of GETSEQ used has been extended to support EMBL, GENBANK, Staden (DButil), FASTA, CODATA and PIRFORM formatted files. Any file not recognised is read in verbatim. This routine is also used to input from the DNA database and has no difficulties accessing the files on the EMBL CD-ROM. When the CDROM is used it is preferable to use the NBRF format database files in the PIRFORM subdirectory as the program executes more rapidly. The program does not have an upper limit on the size of a sequence in the database, as large entries are read in segments. The program executes quite rapidly as a result of the algorithm used. Searching for any probe length fragment from the data-base entry is only continued whilst less than the maximum number of mismatches has been encountered. As soon as the permitted number of mismatches has been exceeded the scan using that particular subsequence is aborted. Consequently the program run time is not linearly dependent on the length of the probe sequence. To speed the program further each base is bitwise encoded. This means that a search using a redundant probe takes no longer than looking for identity to a non-redundant probe. The efficiency of the program can be illustrated using a 30 nucleotide probe which was scanned through the primate division of release 29 of the EMBL database (13,658,513 bases in 11,424 entries) on CD-ROM. The times taken for the programs EMBLScan, FASTA and PROBE on a 20MHz 386SX PC were lmin, lhr 30min and 30min respectively. In this comparison we can see that EMBLScan is very fast, but as already mentioned is unable to deal with mismatches. PROBE runs faster than FASTA in this context and it scans both DNA strands in a single pass. PROBE runs on an IBM-compatible PC and was written in ANSI standard ' C using the Turbo C compiler. The code should port to other platforms with no difficulty. The program can either be used interactively, when the program prompts the user for the input data required, or it can accept this information from the command line. This latter feature means that it is easy to set up batch files enabling a number of scans to be performed without >HSAFX?. 318 > - Human aromatase m A , complete cds TCCTGTGGATGGGGATCGGCAGTGCCTGCA *********r*ttt*+tttt**.****,** total matches: 1 >HSAROMAT - Human M A for aromatase synthetase) 3 1 8 > TCCTGTGGATGGGGATCGGCAGTGCCTGCA (estrogen ............................... total matches: 1 >HSAFlE'450 - H u m n mRNA for aromatase P-450 2 4 3 > TCCTGTGGATGGGGATCGGCAGTGCCTGCA .............................. total matches: 1 >HSCY4ARO - Human aromatase system cytochrome P-450 (P45OXIX) m A , complete 2 2 1 > TCCTGTGGATGGGGATCGGCAGTGCCTGCA *t**t*t***t*t*t*****tt********* total matches: 1 >HSCYAR03 Human aromatase cytochrome P-450 gene, exon - 3 194 > TCCTGTGGATGGGGATCGGCAGTGCCTGCA .............................. total matches: 1 The database contained 1 3 6 5 8 5 1 3 residues in 1 1 4 2 4 entries. Overall a total of 5 matches in 5 sequences were found direct user intervention. This program is available from the author at the above address, or via e-mail (address [email protected]). I thank The Wessex Medical Trust for financial support. 1. 2. 3. 4. 5. 6. 7. Rychlik, W. & Rhoads, R.E. (1989) Nucleic Acids Res. 16, 8543-8551 Lowe,T., Sharefkind., Yang, S.Q. & Dieffenbach, C.W. (1990) Nucleic Acids Res. 18, 1757-1761 Lucas, K., Busch, M., Mossinger, S. & Thompson, ].A. (1991) Comput. Applic. Biol. Sci. 7, 525-529 Bloomfield, M. & Giles, I.G. (1992)Biochem. SOC.Trans in press Pearson, W.R. & Lipman, D.J. (1988) Proc. Natl. Acad. Sci. USA 85,2444-2448 Higgins, D.G. & Stoehr, P.J. (1991)EMBLScan User Guide, EMBL Data Library, Heidelberg, Germany. Cockwell, K.Y. & Giles, I.G. (1989) Comput. Applic. Biol. Sci. 5, 227-232