Download A compact new computer program for handling nucleic acid se

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genome evolution wikipedia , lookup

Gene expression wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Replisome wikipedia , lookup

Bottromycin wikipedia , lookup

RNA-Seq wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Proteolysis wikipedia , lookup

Protein structure prediction wikipedia , lookup

Gene wikipedia , lookup

List of types of proteins wikipedia , lookup

Point mutation wikipedia , lookup

Molecular evolution wikipedia , lookup

Metabolism wikipedia , lookup

Enzyme wikipedia , lookup

Community fingerprinting wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Expanded genetic code wikipedia , lookup

Biochemistry wikipedia , lookup

Genetic code wikipedia , lookup

Transcript
volume 10 Number i 1982
Nucleic Acids Research
A flexible new computer program for handling DNA sequence data
Manfred Kio'ger and Anneliese Kroger-Block
Institut fur Biologie III, Universitat Freiburg, Schanzlestr. 1, D-7800 Freiburg, GFR
Received 14 September 1981
ABSTRACT
A compact new computer program for handling nucleic acid sequence data is presented. It consists of a number of different
subsets, which may be used according to a given code system. The
program is designed for the determination of restriction enzyme
and other recognition sites in correlation with translation patterns, and allows tabulation of codon frequencies and protein
molecular weights within specified gene boundaries. The program
is especially designed for detection of overlapping genes. The
language is FORTRAN and thus the program may be used on small
computers; it may also be used without any prior computer experience. Copies are available on request.
INTRODUCTION
An increasing number of nucleic acid sequence data has become
available due to rapidly evolving DNA-sequencing techniques ' .
In addition a rapidly growing number of commercially available
restriction enzymes can be used for mapping prior or during the
sequencing work, or for extending into a cloning analysis of genes and signal structures that may be contained in that sequence.
Thus a rapid and complete interpretation of the sequence data has
become increasingly important as a tool for designing the next
experimental' step. Consequently, the computer handling should
be simple so that it might be done even by people without any
specific knowledge in computer techniques. Another point of increasing importance is storage, manipulation and editing of the
information accumulated in a compact form in order to have all
information printed together and to keep the paper output as
small as possible, and hence storage volume and printing costs
as well as printing time should be minimal.
We have written a new versatile FORTRAN program which meets
© IRL Press Limited. 1 Falconberg Court, London W1V 5FG, U.K.
229
Nucleic Acids Research
these reqirements. For an analysis of the coding properties we
wanted to have the capacity for simultaneously printing out all
six amino acid reading frames directly underneath the nucleic
acid strands. For mapping and cloning purposes we also wanted
to have all cleavage sites of known restriction enzymes printed
above the actual cutting position. Upon introduction of gene
boundaries the amino acid lanes in an edited print out can
be reduced i to one or two reading frames (in case of overlapping genes) and the number of restriction endonuclease and other
recognition sites may be restricted to any preselected combination. Finally, codon usage can be determined for the different
reading frames in general, or within any set of boundaries specified for genes or segments of genes. These data will be printed as a table of codon frequencies and are also converted into
a molecular weight determination for the various resulting proteins .
Another computer program with a similar objective has been
published earlier by Staden ' . However, Staden's program is
less compact and needs a series of call ups, while our program
uses a simple code system to provide a variety of different
printouts.
The following two chapters are a quite detailed description
of the program. They may be skipped by readers without personal
computer experience. However, Table 1 and the Figures - as a visual description of the program - should be seen before reading
the discussion.
GENERAL PROGRAM ARCHITECTURE
To provide maximal user's comfort together with minimal program expense a few communications via the computer keyboard are
necessary. During the starting routine the appropriate file will
be called up. The user will now answer two - or three - program
questions concerning the use of i) a total file or only part
thereof and ii) the Program Execution Code. If required by the
selected code, the statement about iii) gene borders shall also
be given.
If the entire file shall be used, a 0 is typed in instead of
a boundary. If only part of the file shall be used, the numbers
of the first and the last nucleotide - connected by a comma have to be specified as boundaries. The file may exceed the
number of 5O.OOO nucleotides in order to be able to handle
large sequence data.
230
Nucleic Acids Research
Table 1
List of different outprints provided by Program Execution Code
Code
number
1
2
3
4
5
6
7
8
9
10
11
Content of outprint
single strand
double strand
single strand with one line of amino acids
single strand with three lines of amino acids
double strand with one line of amino acids for
each strand
double strand with three lines of amino acids for
each strand a
double strand with complete set of restriction enzyme data
double strand with restriction enzyme data and with
three lines of amino acids for each strand 3
double strand with restriction enzyme data and with
selected lines of amino acids (genes) a
double strand with restriction enzyme data and with
selected lines of amino acids (genes) and codon
usage table
codon usage table and molecular weights for selected genes (proteins)
a) For an example see Fig. 1
b) For an example see Fig. 2
The Program Execution Code provides eleven different outprints according to Table 1. When code numbers 1 through 8 are
selected no more communication via the keyboard is necessary.
The call up of code numbers 9 through 11 requires an additional
statement about gene borders. Gene borders may be given for any
complete or partial genes. No specification of gene borders after call up of codes 9 through 11 leads to an error statement.
The program includes a table of restriction enzyme recognition sites. This table will not be changed normally, but can be
expanded'or altered by any individual with some greater experience in FORTRAN programming.
231
Nucleic Acids Research
code2
OCCa
code 7 :
Bbvl
EclS
4
•
MF
m
lu
14
*
G HP TS
*
E
nvl
liu
121
d an ha c
1 «J K *
2 34 12 I
nvl
luu
G
d
1
2
E
c
H
D
R
2
h
I
E
c
R
2
H
D
h
I
codeB:
Ebwl
lu
14
in
HF TS
an ha
E
eu ac
c
«
34 12
I
CGTCCTCwaTXX^TO»c1*AcrccACT^CG^GCA>^^
MaAlaAspTVrProMaAsp-D-GlvHisMBArgTroPrcArqTrD-p-ProClvCvsG]
GlnGlnIlelhrGlnLeuIl«GlUValHetLeuValGlvArqClvGlv*spGlnAlaV»lG
SorArqLetiProSer-p-LeiVurqSorCvsSorLouAlnAlaValValThrArqLcuTrD
CvsCvBlleValTrDSerlleSerThrMotSerThrProArqProProSerTrnAlaihrPr
LeuWuA3nGlvLalJC!lnAsnLeu*Bp^^lBGlLl^snAla^la^Tlr^h^V!^lUuSorHIs^l
MaSer-c-GlvMaSortlnPro-D-MaArqGlrClv»rqHlsHi9GlvPtoGlnPro
code 9:
Bbvl
EclS
MF
nn
lu
14
nvl
luu
121
G HF TS
E
dm In
1 niK
c
2 34 12
a
I
E
c
R
2
H
p
h
I
MBtLeuVMGlvArqClvClvAspClnMaValC)
Fig. 1 Examples for outprints provided by code numbers 2, 7, 8,
and 9. The sequence shown is part of ISji sequence^. At position
602 four restriction enzyme cuts fall together. Hnll has priority and is printed in the exact position. Fnu4HI is printed adjacent. Bbvl and EcoP15 are suppressed. A "+" character is printed instead, and at the left site the names of the suppressed
enzyme cuts are shown. Three enzymes carry a "•" character for
optical support. Code 8 outprint without restriction enzyme data is identical with an outprint provided by code 6. In the
example shown for code 9 the amino acids lines are exchanged
compared to the examples shown directly above, since the lowest
number of the gene borders decide about the printing order. Thus
the upper line has to be read from riqht to left, while the
other line has to be read from left to right. This is an illustration for an overlapping gene outprint.
232
Nucleic Acids Research
Ma
Arq
3TART
END
U31
254
629
952
OCA
OCC
GCG
GCT
M>
KG
2
19
5
2 28
4
1
0
5
5
3 13
0
0
on
Asn
AAT
Asp « C
Cvs
Gin
Glj
Glv
His
He
Lys
i
CGC 1 2
COG
4
CCT
3 26
AAC 9
l
1
2
0
0
4
6 15 2 2
5
4
3 7
2
f 2
3
7 10
2
B 2
1
GOC II
7
UG
Leu cm
CTC
2
1
3 16 10 19
7
0
7 14 5 5
1
0
11
4
5 17 3 7
8
1
II It
1
M«t
Pro
GAT
9 14
TOC 1
TGT
0 1
OVA 10
CAG 12 22
GAA
7
GAG 16 23
O3A
1
GOG
GOT
CAC
CAT
ATA
ATC
ATT
AAA
Leu
1 1
«
1
Ser
Thr
Tro
Tvr
Vsl
-#-
—c-
-o-
_
1
21
1
2
6
15
8
6
4
cxr 43
occ
1
OCT
H3Z
6
4
ACT
TCA
1
3
ICC
1
TCG
TCT
1
0
N>
ACC 14
ACG
2
1
ACT
TOG 6
6
we
TUT
1
CTA 1
CTC
6
CTG 5
CTT B
TAG
e
TAA
a0
TGA
CTC
CIG
CTT
TTA
TTG
ATC
TIC
TTT
CCA
TOTAL:
KCXMCi
FOS
NEC t
5
32
15
14
12
4
5
0
4 IS
3 3
1
2 3
2
ed
a
II
2
l
16
17
6
7
12
326
37831
59
37
e0
•
a
el
0
i
0
e
0
•
l
4
2
f
0
fl
1
2
•
0
7
1S8
11069
If
9
Fig. 2 Example for an outprint of the codon usage table. The
data are derived from the IS5^ sequence**. The order is alphabetically for the individual amino acids. The third column
gives the distribution for the different codons, while the
fourth column is the total of each amino acid. Start and end
numbers are shown on top of the table. Total number of amino
acids, exact molecular weight for the encoded protein together
with the numbers of positively and negatively charged amino
acids are given at the end of the table. Note that only for
technical reasons the table is shown in two parts. An overlapping area is shown on the bottom of the first column.
PROGRAM USAGE AND DESCRIPTION
Files used are single strand nucleic acid sequences in conventional 5'-3' order. All outprints are 120 nucleotides per
line. They are preceded by a count starting at number 1 at the
beginning of the file or at the selected starting number. A dot
above every tenth nucleotide provides additional orientation.
No additional numbering other than the machine provided
count is used. Thus sequence files either have to be complete
or in case of an incomplete sequence require the input of an
appropriate number of hyphens or N's as filling signs. However,
every full group of ten hyphens or N's will be suppressed in
the outprint and a "•" character will be printed instead, while
still the correct numbering will be used and printed. Thus des233
Nucleic Acids Research
pite these filling signs a minimal amount of paper outprlnt is
guaranteed. Any length of input record up to 100 is accepted,
blanks are suppressed and hyphens and N's are accepted as unknown nucleotides. Thus corrections within the file can be accomplished easily via a screenboard.
With the help of the code system given in Table 1 a variety
of outprints is available. Code numbers 1 through 6 lead to outprints of either single strand or double strand with or without
the appropriate translation into protein. Figure 1 gives an
example of such an outprint as provided by code 6 (as a part of
code 8 ) . Translation starts at position 1, 2, and 3 relative to
the starting position. The protein sequence encoded by the upper strand is presented in the upper three lines of ami no acids.
Accordingly the lower three lines represent the coding capability of the lower strand. Corresponding to the polarity of the
nucleic acid strand the lower three lines of amino acids should
be read from right to left.
In distinction from each other the three stop codons are
printed with the abbreviations -p- for opal, -a- for amber, and
-c- for ochre.
Code number 7 provides a double strand outprint together with
restriction enzyme data. The program contains a table of commercially or otherwise readily available, or seemingly interesting
restriction enzyme cleavage sites; to date this includes 57 different restriction enzymes. The name of the appropriate restriction enzyme will be printed above the 5'-terminal nucleotide of
the fragment resulting from cleavage at this position (if
known). For restriction endonucleases with unknown cleavage position the first 5'-nucleotide of their recognition sequence
has been chosen instead. When two enzymes cut at the same position, the alphabetically second enzyme will be printed adjacent
to the first enzyme cut, i.e. above the second nucleotide of the
resulting restriction fragment, but with a "/" character pointing to the correct position. Every additional enzyme cut at
the same position will be suppressed, but a "+" character will
be printed instead and the name of the suppressed enzyme cut
will be printed at the beginning of the line. Several enzymes
such as MboII have been selected for preference in printing at
the correct position, when the cleavage site is not within or
immediately adjacent to the recognition sequence. Enzymes with
hexanucleotide palindrome recognition and, therefore, less frequent fragmentations carry an optical support for easier recognition ('major restriction sites'). Figure 1 shows an example
for an outprint provided by code 7.
Code number 8 provides a standard double strand outprint together with restriction enzyme data above and six lines of amino
acids below the DNA sequence. This code provides maximal information and is a combination of codes 6 and 7. Figure 1 shows
an example for an outprint provided by this code.
Code number 9 will lead to the outprint of a double strand
together with restriction enzyme data but only with selected
lines of amino acid sequences (genes). This code requires the
statement of gene borders. The borders may enclose entire or
partial genes. Because of the space available on the print gene
borders for up to 17 genes may be defined in one set, each in
two groups of up to five digits linked by a comma. The order
of lower versus higher or higher versus lower sequence posi234
Nucleic Acids Research
tions defines the selection of an upper or lower strand coding
frame, i.e. a rightward or leftward oriented gene, respectively. It is also possible to print out two or more overlapping
genes. The only necessary prerequisite is that the gene borders
have to be entered according to their sequence position, in increasing order (left to right), with the lower number of each
border pair deciding about the input order. For an example of
a printout provided by code 9 see Figure 1.
Code number 10 extends the analysis of code 9, and in addition for every selected gene provides information about its codon usage, amino acid composition, total number of amino acids,
and the exact molecular weight of the resulting protein together with its number of positive and negative charges. This
additional information for a total of 17 genes in one run is
printed as a table below the sequence chart. For an example of
such a table see Figure 2.
Code 11 provides the codon usage table as described for code
10 without printing any sequence information.
DISCUSSION
The program described in this paper provides compact computer outprints for the analysis of a DNA sequence regarding its
translation and restriction enzyme patterns. After a brief introduction by a computer expert and providing maintenance by
such an Individual, the eleven standard outprints described
above may be obtained by everybody without any computer experience. The flexibility provided by the program architecture allows for easy additional changes. These may regard the number of
nucleotides per line in the printout, different selections of
restriction endonucleases and other recognition sites (such as
E.coli promoter consensus sequence ) , or an analysis of hybrid
fragment combinations from different data files in a pre-evaluation of cloning experiments.
In the latter application the outprint provided by code 8
will yield a direct readout of the expected fusion protein(s),
if different coding areas should be fused together in the cloning reaction. This approach has been successfully applied in
a study of insertion element ISj> coded proteins ' , which have
been analysed as fusion proteins of increased size in addition
to their direct analysis.
Another objective of this program is the detection of overlapping genes within an outprint of otherwise standard information. Overlapping genes of opposite polarity have been observed
235
Nucleic Acids Research
to occur in bacterial insertion element IS5^
. A theoretical
approach for and some hypotheses on coding capabilities of complementary DNA strands were recently published by Cascino et
9
al. . Though they report results of a computer analysis, no
program details are given in their paper.
All runs were performed on a Univac 1108 machine, but the
program may be used on smaller computers as well. It will be
expanded in the near.future by a calculation for-the size of
DNA fragments generated by restriction enzyme cleavages, both
by single or multiple enzyme digestions.
ACKNOWLEDGEMENTS
All programming was done on the Univac 1108 computer of the
Universitats-Rechenzentrum of the Albert-Ludwigs-UniversitSt
Freiburg. We would like to thank Dr. B. Gottwald for constant
help and assistance, and especially Dr. G. Hobom for the biological concepts and continuing discussions.
REFERENCES
1. Maxam, A.M. and Gilbert, W. (1980) Methods Enzymology 65,
499-560.
2. Sanger, F. and Coulsen, A.R. (1978) FEBS Letters §2, 1O7-110.
3. Staden, R. (1977) Nucleic Acids Research A_, 4037-4051.
4. Staden, R. (1978) Nucleic Acids Research 5, 1013-1015.
5. Rosenberg, M. and Court, D. (1979) Ann. Rev. Genet. 13,
319-353.
6. Rak, B., Lusky, M. and Hable, M. (1981) Nature, submitted.
7. Hobom, G., Kroger, M., Rak, B. and Lusky, M. (1981) in
Structure and DNA-Protein Interactions of Replication Origins, ICN-UCLA Symposia on Molecular and Cellular Biology,
XXI (Dan S. Ray and C. Fred Fox, eds.) Academic Press, New
York, in press.
8. Kroger, M. and Hobom, G. (1981) Nature, submitted.
9. Cascino, A., Cipollaro, M., Guerrini, A.M., Mastrocinque, G.,
Spena, A. and Scarlato, V. (1981) Nucl. Acids Res. 9_, 14991588.
236