Download A compact new computer program for handling nucleic acid se

volume 10 Number i 1982 Nucleic Acids Research A flexible new computer program for handling DNA sequence data Manfred Kio'ger and Anneliese Kroger-Block Institut fur Biologie III, Universitat Freiburg, Schanzlestr. 1, D-7800 Freiburg, GFR Received 14 September 1981 ABSTRACT A compact new computer program for handling nucleic acid sequence data is presented. It consists of a number of different subsets, which may be used according to a given code system. The program is designed for the determination of restriction enzyme and other recognition sites in correlation with translation patterns, and allows tabulation of codon frequencies and protein molecular weights within specified gene boundaries. The program is especially designed for detection of overlapping genes. The language is FORTRAN and thus the program may be used on small computers; it may also be used without any prior computer experience. Copies are available on request. INTRODUCTION An increasing number of nucleic acid sequence data has become available due to rapidly evolving DNA-sequencing techniques ' . In addition a rapidly growing number of commercially available restriction enzymes can be used for mapping prior or during the sequencing work, or for extending into a cloning analysis of genes and signal structures that may be contained in that sequence. Thus a rapid and complete interpretation of the sequence data has become increasingly important as a tool for designing the next experimental' step. Consequently, the computer handling should be simple so that it might be done even by people without any specific knowledge in computer techniques. Another point of increasing importance is storage, manipulation and editing of the information accumulated in a compact form in order to have all information printed together and to keep the paper output as small as possible, and hence storage volume and printing costs as well as printing time should be minimal. We have written a new versatile FORTRAN program which meets © IRL Press Limited. 1 Falconberg Court, London W1V 5FG, U.K. 229 Nucleic Acids Research these reqirements. For an analysis of the coding properties we wanted to have the capacity for simultaneously printing out all six amino acid reading frames directly underneath the nucleic acid strands. For mapping and cloning purposes we also wanted to have all cleavage sites of known restriction enzymes printed above the actual cutting position. Upon introduction of gene boundaries the amino acid lanes in an edited print out can be reduced i to one or two reading frames (in case of overlapping genes) and the number of restriction endonuclease and other recognition sites may be restricted to any preselected combination. Finally, codon usage can be determined for the different reading frames in general, or within any set of boundaries specified for genes or segments of genes. These data will be printed as a table of codon frequencies and are also converted into a molecular weight determination for the various resulting proteins . Another computer program with a similar objective has been published earlier by Staden ' . However, Staden's program is less compact and needs a series of call ups, while our program uses a simple code system to provide a variety of different printouts. The following two chapters are a quite detailed description of the program. They may be skipped by readers without personal computer experience. However, Table 1 and the Figures - as a visual description of the program - should be seen before reading the discussion. GENERAL PROGRAM ARCHITECTURE To provide maximal user's comfort together with minimal program expense a few communications via the computer keyboard are necessary. During the starting routine the appropriate file will be called up. The user will now answer two - or three - program questions concerning the use of i) a total file or only part thereof and ii) the Program Execution Code. If required by the selected code, the statement about iii) gene borders shall also be given. If the entire file shall be used, a 0 is typed in instead of a boundary. If only part of the file shall be used, the numbers of the first and the last nucleotide - connected by a comma have to be specified as boundaries. The file may exceed the number of 5O.OOO nucleotides in order to be able to handle large sequence data. 230 Nucleic Acids Research Table 1 List of different outprints provided by Program Execution Code Code number 1 2 3 4 5 6 7 8 9 10 11 Content of outprint single strand double strand single strand with one line of amino acids single strand with three lines of amino acids double strand with one line of amino acids for each strand double strand with three lines of amino acids for each strand a double strand with complete set of restriction enzyme data double strand with restriction enzyme data and with three lines of amino acids for each strand 3 double strand with restriction enzyme data and with selected lines of amino acids (genes) a double strand with restriction enzyme data and with selected lines of amino acids (genes) and codon usage table codon usage table and molecular weights for selected genes (proteins) a) For an example see Fig. 1 b) For an example see Fig. 2 The Program Execution Code provides eleven different outprints according to Table 1. When code numbers 1 through 8 are selected no more communication via the keyboard is necessary. The call up of code numbers 9 through 11 requires an additional statement about gene borders. Gene borders may be given for any complete or partial genes. No specification of gene borders after call up of codes 9 through 11 leads to an error statement. The program includes a table of restriction enzyme recognition sites. This table will not be changed normally, but can be expanded'or altered by any individual with some greater experience in FORTRAN programming. 231 Nucleic Acids Research code2 OCCa code 7 : Bbvl EclS 4 • MF m lu 14 * G HP TS * E nvl liu 121 d an ha c 1 «J K * 2 34 12 I nvl luu G d 1 2 E c H D R 2 h I E c R 2 H D h I codeB: Ebwl lu 14 in HF TS an ha E eu ac c « 34 12 I CGTCCTCwaTXX^TO»c1*AcrccACT^CG^GCA>^^ MaAlaAspTVrProMaAsp-D-GlvHisMBArgTroPrcArqTrD-p-ProClvCvsG] GlnGlnIlelhrGlnLeuIl«GlUValHetLeuValGlvArqClvGlv*spGlnAlaV»lG SorArqLetiProSer-p-LeiVurqSorCvsSorLouAlnAlaValValThrArqLcuTrD CvsCvBlleValTrDSerlleSerThrMotSerThrProArqProProSerTrnAlaihrPr LeuWuA3nGlvLalJC!lnAsnLeu*Bp^^lBGlLl^snAla^la^Tlr^h^V!^lUuSorHIs^l MaSer-c-GlvMaSortlnPro-D-MaArqGlrClv»rqHlsHi9GlvPtoGlnPro code 9: Bbvl EclS MF nn lu 14 nvl luu 121 G HF TS E dm In 1 niK c 2 34 12 a I E c R 2 H p h I MBtLeuVMGlvArqClvClvAspClnMaValC) Fig. 1 Examples for outprints provided by code numbers 2, 7, 8, and 9. The sequence shown is part of ISji sequence^. At position 602 four restriction enzyme cuts fall together. Hnll has priority and is printed in the exact position. Fnu4HI is printed adjacent. Bbvl and EcoP15 are suppressed. A "+" character is printed instead, and at the left site the names of the suppressed enzyme cuts are shown. Three enzymes carry a "•" character for optical support. Code 8 outprint without restriction enzyme data is identical with an outprint provided by code 6. In the example shown for code 9 the amino acids lines are exchanged compared to the examples shown directly above, since the lowest number of the gene borders decide about the printing order. Thus the upper line has to be read from riqht to left, while the other line has to be read from left to right. This is an illustration for an overlapping gene outprint. 232 Nucleic Acids Research Ma Arq 3TART END U31 254 629 952 OCA OCC GCG GCT M> KG 2 19 5 2 28 4 1 0 5 5 3 13 0 0 on Asn AAT Asp « C Cvs Gin Glj Glv His He Lys i CGC 1 2 COG 4 CCT 3 26 AAC 9 l 1 2 0 0 4 6 15 2 2 5 4 3 7 2 f 2 3 7 10 2 B 2 1 GOC II 7 UG Leu cm CTC 2 1 3 16 10 19 7 0 7 14 5 5 1 0 11 4 5 17 3 7 8 1 II It 1 M«t Pro GAT 9 14 TOC 1 TGT 0 1 OVA 10 CAG 12 22 GAA 7 GAG 16 23 O3A 1 GOG GOT CAC CAT ATA ATC ATT AAA Leu 1 1 « 1 Ser Thr Tro Tvr Vsl -#- —c- -o- _ 1 21 1 2 6 15 8 6 4 cxr 43 occ 1 OCT H3Z 6 4 ACT TCA 1 3 ICC 1 TCG TCT 1 0 N> ACC 14 ACG 2 1 ACT TOG 6 6 we TUT 1 CTA 1 CTC 6 CTG 5 CTT B TAG e TAA a0 TGA CTC CIG CTT TTA TTG ATC TIC TTT CCA TOTAL: KCXMCi FOS NEC t 5 32 15 14 12 4 5 0 4 IS 3 3 1 2 3 2 ed a II 2 l 16 17 6 7 12 326 37831 59 37 e0 • a el 0 i 0 e 0 • l 4 2 f 0 fl 1 2 • 0 7 1S8 11069 If 9 Fig. 2 Example for an outprint of the codon usage table. The data are derived from the IS5^ sequence**. The order is alphabetically for the individual amino acids. The third column gives the distribution for the different codons, while the fourth column is the total of each amino acid. Start and end numbers are shown on top of the table. Total number of amino acids, exact molecular weight for the encoded protein together with the numbers of positively and negatively charged amino acids are given at the end of the table. Note that only for technical reasons the table is shown in two parts. An overlapping area is shown on the bottom of the first column. PROGRAM USAGE AND DESCRIPTION Files used are single strand nucleic acid sequences in conventional 5'-3' order. All outprints are 120 nucleotides per line. They are preceded by a count starting at number 1 at the beginning of the file or at the selected starting number. A dot above every tenth nucleotide provides additional orientation. No additional numbering other than the machine provided count is used. Thus sequence files either have to be complete or in case of an incomplete sequence require the input of an appropriate number of hyphens or N's as filling signs. However, every full group of ten hyphens or N's will be suppressed in the outprint and a "•" character will be printed instead, while still the correct numbering will be used and printed. Thus des233 Nucleic Acids Research pite these filling signs a minimal amount of paper outprlnt is guaranteed. Any length of input record up to 100 is accepted, blanks are suppressed and hyphens and N's are accepted as unknown nucleotides. Thus corrections within the file can be accomplished easily via a screenboard. With the help of the code system given in Table 1 a variety of outprints is available. Code numbers 1 through 6 lead to outprints of either single strand or double strand with or without the appropriate translation into protein. Figure 1 gives an example of such an outprint as provided by code 6 (as a part of code 8 ) . Translation starts at position 1, 2, and 3 relative to the starting position. The protein sequence encoded by the upper strand is presented in the upper three lines of ami no acids. Accordingly the lower three lines represent the coding capability of the lower strand. Corresponding to the polarity of the nucleic acid strand the lower three lines of amino acids should be read from right to left. In distinction from each other the three stop codons are printed with the abbreviations -p- for opal, -a- for amber, and -c- for ochre. Code number 7 provides a double strand outprint together with restriction enzyme data. The program contains a table of commercially or otherwise readily available, or seemingly interesting restriction enzyme cleavage sites; to date this includes 57 different restriction enzymes. The name of the appropriate restriction enzyme will be printed above the 5'-terminal nucleotide of the fragment resulting from cleavage at this position (if known). For restriction endonucleases with unknown cleavage position the first 5'-nucleotide of their recognition sequence has been chosen instead. When two enzymes cut at the same position, the alphabetically second enzyme will be printed adjacent to the first enzyme cut, i.e. above the second nucleotide of the resulting restriction fragment, but with a "/" character pointing to the correct position. Every additional enzyme cut at the same position will be suppressed, but a "+" character will be printed instead and the name of the suppressed enzyme cut will be printed at the beginning of the line. Several enzymes such as MboII have been selected for preference in printing at the correct position, when the cleavage site is not within or immediately adjacent to the recognition sequence. Enzymes with hexanucleotide palindrome recognition and, therefore, less frequent fragmentations carry an optical support for easier recognition ('major restriction sites'). Figure 1 shows an example for an outprint provided by code 7. Code number 8 provides a standard double strand outprint together with restriction enzyme data above and six lines of amino acids below the DNA sequence. This code provides maximal information and is a combination of codes 6 and 7. Figure 1 shows an example for an outprint provided by this code. Code number 9 will lead to the outprint of a double strand together with restriction enzyme data but only with selected lines of amino acid sequences (genes). This code requires the statement of gene borders. The borders may enclose entire or partial genes. Because of the space available on the print gene borders for up to 17 genes may be defined in one set, each in two groups of up to five digits linked by a comma. The order of lower versus higher or higher versus lower sequence posi234 Nucleic Acids Research tions defines the selection of an upper or lower strand coding frame, i.e. a rightward or leftward oriented gene, respectively. It is also possible to print out two or more overlapping genes. The only necessary prerequisite is that the gene borders have to be entered according to their sequence position, in increasing order (left to right), with the lower number of each border pair deciding about the input order. For an example of a printout provided by code 9 see Figure 1. Code number 10 extends the analysis of code 9, and in addition for every selected gene provides information about its codon usage, amino acid composition, total number of amino acids, and the exact molecular weight of the resulting protein together with its number of positive and negative charges. This additional information for a total of 17 genes in one run is printed as a table below the sequence chart. For an example of such a table see Figure 2. Code 11 provides the codon usage table as described for code 10 without printing any sequence information. DISCUSSION The program described in this paper provides compact computer outprints for the analysis of a DNA sequence regarding its translation and restriction enzyme patterns. After a brief introduction by a computer expert and providing maintenance by such an Individual, the eleven standard outprints described above may be obtained by everybody without any computer experience. The flexibility provided by the program architecture allows for easy additional changes. These may regard the number of nucleotides per line in the printout, different selections of restriction endonucleases and other recognition sites (such as E.coli promoter consensus sequence ) , or an analysis of hybrid fragment combinations from different data files in a pre-evaluation of cloning experiments. In the latter application the outprint provided by code 8 will yield a direct readout of the expected fusion protein(s), if different coding areas should be fused together in the cloning reaction. This approach has been successfully applied in a study of insertion element ISj> coded proteins ' , which have been analysed as fusion proteins of increased size in addition to their direct analysis. Another objective of this program is the detection of overlapping genes within an outprint of otherwise standard information. Overlapping genes of opposite polarity have been observed 235 Nucleic Acids Research to occur in bacterial insertion element IS5^ . A theoretical approach for and some hypotheses on coding capabilities of complementary DNA strands were recently published by Cascino et 9 al. . Though they report results of a computer analysis, no program details are given in their paper. All runs were performed on a Univac 1108 machine, but the program may be used on smaller computers as well. It will be expanded in the near.future by a calculation for-the size of DNA fragments generated by restriction enzyme cleavages, both by single or multiple enzyme digestions. ACKNOWLEDGEMENTS All programming was done on the Univac 1108 computer of the Universitats-Rechenzentrum of the Albert-Ludwigs-UniversitSt Freiburg. We would like to thank Dr. B. Gottwald for constant help and assistance, and especially Dr. G. Hobom for the biological concepts and continuing discussions. REFERENCES 1. Maxam, A.M. and Gilbert, W. (1980) Methods Enzymology 65, 499-560. 2. Sanger, F. and Coulsen, A.R. (1978) FEBS Letters §2, 1O7-110. 3. Staden, R. (1977) Nucleic Acids Research A_, 4037-4051. 4. Staden, R. (1978) Nucleic Acids Research 5, 1013-1015. 5. Rosenberg, M. and Court, D. (1979) Ann. Rev. Genet. 13, 319-353. 6. Rak, B., Lusky, M. and Hable, M. (1981) Nature, submitted. 7. Hobom, G., Kroger, M., Rak, B. and Lusky, M. (1981) in Structure and DNA-Protein Interactions of Replication Origins, ICN-UCLA Symposia on Molecular and Cellular Biology, XXI (Dan S. Ray and C. Fred Fox, eds.) Academic Press, New York, in press. 8. Kroger, M. and Hobom, G. (1981) Nature, submitted. 9. Cascino, A., Cipollaro, M., Guerrini, A.M., Mastrocinque, G., Spena, A. and Scarlato, V. (1981) Nucl. Acids Res. 9_, 14991588. 236

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A compact new computer program for handling nucleic acid se