* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A General Method Applicable to the Search for Similarities in the
Endogenous retrovirus wikipedia , lookup
Magnesium transporter wikipedia , lookup
Interactome wikipedia , lookup
Signal transduction wikipedia , lookup
Biochemical cascade wikipedia , lookup
Peptide synthesis wikipedia , lookup
Metalloprotein wikipedia , lookup
Homology modeling wikipedia , lookup
Paracrine signalling wikipedia , lookup
Western blot wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Point mutation wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Biosynthesis wikipedia , lookup
Proteolysis wikipedia , lookup
J. Mol. Bwl. (1970)48, 443-453 A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins SAGLB. NEEDLE>=ASD CHRISTUS D. WUKSCH Department of Biochemistry, Northwestern University, and hlwlear Medicine Service, V. A . Research Hospital Chicago, Ill. 60611, U.S.A. (Received 21 July 2969) A computer adaptable method forfinding similarities in the amino acid sequences of two proteins has been developed.From these findings it is possible to determine whether significant homology exists between the proteins. This information is used to trace their possible evolutionary development. The maximum match is a number dependent' upon the similarity of the sequences. One of its definitionsis the largest number of amino acids of one protein that can be matched with those of a second protein allowing for all possible interruptions in either of the sequences. While the interruptions give rise to a very large number of comparisons, the method efficiently excludes from consi- deration those comparisonsthat cannot contribute to the maximum match. Comparisons are made from the smallest unit of significance, a pair of amino acids, one from each protein. All possible pairs are represented by a two-dimensional array, and all possible comparisons are representod by pathways through the array. For this maximum match only cerhain ofthe possible pathways must be evaluated. A numerical value, one in this case. is assigned to every cell in the array representing like aminoa c i d s . The maximum match is the largest number that would result from summing the cell values of every pathway. 1. Introduction The amino acid sequences of a number of proteins have been compared to determine whether the relationships existing between them could have occurred by chance. Generally, these sequences are from proteins haring closely related functions and are so similar that simple visual comparisons can reveal sequence coincidence. Because the method of visual comparison is tedious and because the determination of the significance of a given result usually is left t o intuitive rationalization, computerbased statistical approaches have been proposed (Fitch, 1966; Xeedleman 6 Blair, 1969). Direct comparison of two sequences, based on the presence in both of corresponding amino acids in a n identical array, is insuEcient to establish the full genetic relationships between the two proteins. Allowance forgaps(Braunitzer, 1965) greatly mnltiplies the number of comparisons t.hat can be made but int.roduces unnecessary and partial comparisons. 2. A General Method for Sequence Comparison The smallest unit of comparison is a pair of amino acids, one from each protein. The maximum match can be definedas the largest numberof amino acids of one protein that can be matched with those of another protein while allowing for all possible deletions. 4.43 444 S. 15. S E E l l L E . \ I A S A S D C. D. \VLSSCH The maximum match can be determined by representing in a two-dimensional a m p , all possible pair combinations that can be constructed from the amino acid sequenw of the proteins, A and B, being compared. If the amino acids are numbered from the X-terminal end, A j is t l ~ e j t hamino acid of protein A and Bi is the ith amino acid of protein B. The A j represent the columns and the Bi therows of the two-dimensional array, BUT. Then the cell, MATij, rcprescnts a pair combination that contains Aj and Bi. Every possible comparison can now be represented by pathways through the array. An i or j can occur only once in a pathway because a particular amino acid cannot occupy more than one positionat one time.Furthermore,if JLiTmnis part ofapathway including MATij, the only permissible relationships of their indices are m > i, n > j or m < i, n <j. Any other relationships represent permutations of one or both amino acid sequences which cannot beallowedsince this destroys the significance of a sequence. Then any pathway can be represented by X A T Q .~. .JLaTyz, where a 3 1, b 3 1, the i and j of all subsequent cells of $UT are larger than therunning indices of the previous cell and y K,z 111, the total number of amino acids comprising the sequences of proteins A and B, respectively. A pathway is signified by a line connecting cells of the array. Complete diagonals of the arraycontain nogaps. When MATij and MATmn are part of a pathway, i - na # j - n is a sufficient, but not necessary condition for a gap t o occur. A necessary pathway through &UT is d e h e d as one which begins a t a cell in the first column or the first row. Both iand j must increase in value; either i or j must increase by only one but the other index mky increase by one or more. This leads to the nest cell in a JUT pathway. This procedure is repeated until either i or j , or both, equal their limiting ralues, R and d l , respectively. Every partial or unnecessary pat.hway will be contained in at least one. necessary pathway. In the simplest method, MATij is a s s i g n e d the value, one, if -4j is the same kind of amino acid as Bi; if they are different amino acids, MATij is assigned the value, zero. The sophistication of the comparison is increased if, instead of zero or one, each cell value is made a function of the composition of the proteins, the genetic code triplets representing the amino acids, the neighboring cellsin the array, or any theory concerned with the si,gnificance of a pair of amino acids. A penalty factor, a number subtracted for every gap made, may be assessed as a barrier to allowing the gap. The penalty factor could be a function of the size and/or direction of the gap. No gap would be allowed in the operation unless the benefit from alloming that gap mould exceed the barrier. The maximum-match pathway then, is that pat.hway for which the sum of the assigned cell ralues (less any penalty factors) is largest. MAT can be broken u p into subsections operated upon independently. The method also can be expanded to allow simultaneous comparison of several proteins using the amino acid sequences of n proteins to generate an n-dimensional array whose cells represent all possible combinations of n amino acids, one from each protein. The maximum-match pathway can be obtained by beginning at the terminals of the sequences (i = y,j = z) and proceeding toward the origins, fist by adding to the value of each cell possessing indices i = y - 1 and/or j = z - 1, the maximum value from among all the cells whichlie on a pathwayto it. Theprocess is repeated for indices i = y - 2 and/orj = z - 2. This increment in the indices is continued until all cells in thematrix haye been operated upon. Each cell in this outerrow or column will contain the maximum number of matches that can be obtained by originating < < r: S I M I L A R I T I E S I N AMINO ACID SEQUEXCE 445 any pathway a t that cell and the largest nurnbcr in that row or column is equal to the maximum match; thc maximum-match pathway in any row or column must begin a t this number. The opcration of successive summations of ccll values is illustrated in Figures 1 and 2. A B C N J R O C L C R P M 2 0 0 A 1 J 1 C J 1 N t R 4 3 c 3 3 4 3 3 3 3- 4 ~ 3 3 3 3 3 3 3 C 1 1 I 2 2 R z ~ 1 P 3 I 2 2 0 2 : 0 ~ 1 0 0 3 - 3 3 -3 2 2 2 1 2 1 2 ( 3 1 0 0 3 2 t o o 3 1 1 1 11 1 i 1 1 1 1 0 G T 0 0 0 O O i 2 0 0 0 i 0 1 0 1 i 0 ~ ~ FIG.1. The Inaxhnunl-match operation for necessary pnthways. The number contained in each cell of the m y is the largest number of identical pairs that can be found if that cell is the origin for a pathway which proceeds with increases in running indices. Identicat pairsofamino acids were given the valueof one. Blank cells whichrepresent non-identical pairs have the value. zero.T h e operation of successive Bummationswas begun at the last row of the array and proceeded row-by-row towards t,he6rst row. Tho opcration has been partially completed in the R row.The enclosscl cellin this row is the site of the cell operation which consists of a search along the subrow and subcolumn indicated by borders for thc largcst value, 4 in subrow C. This value is added to tho cell from which tho search began. A i B C N J R 7 O C 4 3 C 6 4 4 J 6 4 3 N 5 4 3 R 4 4 4 4 4 c 3 3 4 3 3 L C R P M 3 FIG. 2. Contributors to the 1naximum match in the completed array. The alternative pathways thnt could form tho maximu~umatch am illustrated. The maximum match terminates at the largest number in the first row or first column, 8 in t h i s case. \. ? : 446 S . B . S E E D L E NA ASND C. D. W U N S C H It is apparent that the above array operation can begin a t any of a number of points along the borders of the array,which is equivalent to a comparisonof N-termi. nal residues or C-terminal residues only. As long as theappropriate rules for pathwsga are followed, the maximum match mill be the same. The cells of the array whi& contributed to the maximum match, may be determined by recording the origin of the number that mas added to each cell when the array was operated upon. 3. Evaluating the Significance of the Maximum Match A given maximum match may represent the maximum number of amino acids matched, or i t may just be a number that is a complex function of the relationship between sequences. It will, however, always be a function of both the amino acid compositions of the proteins and the relationship between their sequences. One may ask whether a particular result found differs significantly from a fortuitous match between two random sequences. Ideally,onewould prefer to h o w the exactprobability of obtaining the result found from a pair of random sequences and whatfraction of the total possibilities are less probable, but thatis prohibitively difficult, especially if a comples functionmere used for assigning a valueto the cells. As an alternative to determining the esact probabilities, i t is possible to estimate t.he probabilities experimentally. To accomplish the estimate one can constructtwo sets of random sequences, a set from the amino acid compositionof each of the proteins compared. Pairs of random sequences can then be formedby randomly h + g one member from each set. Determiningthe maximurn match for each pair selec&$ w l iyield 8 set of random values. If the value foundfor the real proteins is significant12 different from the values foundfor the random sequences,the difference is a function of the sequences aloneand not of t,he compositions.Alternatively, one can construct random sequences from only one of the proteins andcompare them with the otherto determine a set of random values. The twoprocedures measuredifferent probabilities. The first procedure determines whether a significant relationship exists between the real sequences. The second procedure determines whether the relationship of the protein used to form the random sequencesto the otherproteins is significant. It bears reiterating that the integral amino acid composition of each random sequence must be equal to that of t.he protein it represents. The amino acid sequence of each protein compared belongs to a set of sequences which are permutations. Sequences drawm randomly from one or both of these seta are used to establish a dist.ribution of random maximum-match vaIueswhich would include allpossible values if enough comparisonswere made. The null hypothesis, that any sequence relationship manifested by the two proteins is a random one, is tested. If the distribution of random values indicates a small probability that a maximum match equal to, or greater than, thatfound for the two proteinscould be drawnfrom the random set, the1lypot.hesis is rejected. 4. Cell Values and Weighting Factors To provide a theoretical framework for experiments, amino acid pairs may be classified into two broad types, identical and non-identical pairs. From 20 different amino acids one can construct IS0 possible non-identical pairs. Of these, 75 pairs of amino acids have codons(Marshall. &key h Nirenberg, 1967) whose bases differ at only one position (Eck & Dayhoff, 196G).Each change is presumably the result of a 1 ' SIMILARITIES I N AMINO ACID SEQUENCE 447 single-point mutation. The majority of non-identical pairs have a maximum of only one or zero corresponding bases. Due to the degeneracy of the genetic code, pair differences representing amino acids with no possible corresponding bases are uncommon even in randomly selected pairs. If cells are weighted in accordance with the maximum number of coxsponding bases in codons of the represented amino acids, the maximum match will be a function of identical and non-identical pairs. For comparisons in general, the cell weights can be chosen on any basis. If every possible sequence gap is allowed in forming the maximum match, the significance of the maximum match is enhanced by decreasing the weight of those pathways containing a large number of gaps. A simple way to accomplish this is to assign a penalty factor, a number which is subtracted from themaximum match for each gap used to form it. The penaltyis assigned beforethe maximum matchis formed. Thus the pathways will be weighted according to t.he number of gaps they contain, but the natureof the contributors to the maximum match will be affected as well. In proceeding from one cell to the next in a maximum-match pathway, it i-D necessary that thedifference betweeneach cell value and the penalty, be greater than t.he value for a cell in a pathway that contains no gap. If the ralueof the penaltywere zero, all possible gaps could be allowed. If the value were equal to the theoretical value for the maximum match between two proteins, i t would be impossibleto allow a gap and the maximum match rould be the largest of the values found by simply summing along the diagonals of the array; this is the simple frame-shift method. 5. Application of the Method To illustrate the role of weighting factors in evaluating a maximum match, two proteins expected t o shorn homology, whale myoglobin (Edmundson, 1965) and human &hemoglobin (Konigsberg, Goldstein & Hill, 1963), and two proteins not expected to exhibit homology, bovine pancreatic ribonuclease (Smyth,Stein 8: Uoore, 1963) and hen’s egg lysozyme (Canfield, 1963) were chosen for comparisons. The FORTRAX programs used in this studywere written for t.he CDC3400 computer. The operations employed in forming the maximum match are those for the special case when none of the cells of the array haTe a value less than zero. Four types of amino acid pairs were distinguished a.nd Tanable sets consisting of ralues to be assigned fo each type of pair and a value for the penalty were established. The pair types are as follo~vs: Type 3.Identical pairs: those having a masimum of three corresponding bases in their codons. Type 2. Pairs having a masimum of two corresponding basesin t,heir codons. Type I. Pairs having a maximum of one corresponding base in t,heir codons. Type 0. Pairs having no possible corresponding basein their codons. The value for type 3 pairs was 1-0and t.he value for type 0 pairs was zero for all variable sets. A t program execution time, the amino acids (coded by two-digit numbers) of the sequences to be compared were read into the computer, and were followed by a twenty-by-twenty symmetricalarray, themaximum correspondencearray, analogous e0 one used by Fitch (1966), that contained all possible pairs of amino acids and identified each pair as to type. The RN-4 codons for amino acids used to construct the maximum-correspondencearray were taken from a single Table (Marshall et al., I 448 S. B. NEEDLEMAN A N D C. D. W U N S C H 1967). The UGX, U-4-4 and UAG codonswere not used, but UUG was used a codon for leucinc. The subsequent data cards indicated the numerical values for a variable set. The two-dimensional comparison array was generated row-by-row. The amino acid code numbers for Ai and Bj refcrencctl the correspondence array to determine the type of amino acid pair constituted by Ai and Bj. The type number referencedshort a array, the variable set, containing the type values, and the appropriate value from' that set was assigned to the appropriate cell of the comparison array. Thc maximum match was then determined by the procedure of successire summations. Following thedetermination of themaximum matchfor the real proteins, the amino acid sequence of only onemember of the protein pair was randomized and thematch was repeated. The sequences of ,%hemoglobinand ribonucleasewere the ones random. iied. The randomization procedure was a sequence shuffling routine based on computer-generated random numbers. -4 cycle of sequence randomization-maximummatch determination was repeated ten times in all of the experiments in this report, giving the random ralues used for comparison with the real maximum-match. The average and standard deviat,ion for the random values of each variable set was estimated. ' 6. Results and Discussion The use of a small random sample size (ten)was necessary to hold the computer time to a reasonable level. The maximum probable error in a standard deviation estimate for a sample this small is quite large and the resultsshould be judged with this fact in mind. For each set of variables, i t was assumed that the random values would be distributedin the fashion of the normal-error curve;therefore, the vduesof the first six random sets in the,B-hemoglobin-myoglobin comparison were converted to standard measure, five was added to the result, and these values were plotted aa one group against their calculated probit. The resultsof the plot are shorn in Figure 3. The fit is good indicating the probable adequacy of the measured standard deviations for these variable sets in estimating distribution functions for random valuesthrough two standard derktions. The above fit indicates no bias in the randomization procedure. In other words, randomization of the sequence was complete before the maximum mat.chwas determined for any sequence in a random set. The resultsobtained inthe comparison of,B-hemoglobin withmyoglobin are summarized in Table1 and t.he results for the ribonuclease-lysozyme comparison are in Table 2. These Tables indicate t.he values assigned to the pair types, the penalty factor used in forming each of t.he maximum matches, and bhe statistical results obtained. The numbero f gaps roughly characterizes the natureof the pathway that formed the maximum ma.tch. A large number is indicative of a devious pathway through the array.One gap means t.hatall of the pa,thwag maybe found on only two partial diagonals of the array. The most important information is obtained from the standardized value of the masimum match for the real proteins, the difference from the mean in standarddeviation units. For this sample size all deviations greater than 3-0were assumed to include less than 17; of the true randompopulation and to indicate a significant difference. As might be expected, all matches of myoglobin and &hemoglobin show a significant deyiation. Among the sets of variables, set 1, which results in a search for identical amino acid pairs while allowingfor all deletions, indicates that 63 *: 7 .' ./* 1 / Vcriclb!e in stcndorllzed measure Flc. 3. Probit plot for six grouped random samples. The solid line indicates the plot that ~ w u l dresult from a probit analysis on an infinite number of sarnptes from a normdly-distributed population. The points represent the results of probit calculations on 60 random masimum match values that, were assumed to have come from one population. set 1 2 3 4 5 6 - Uatch values for pair t ? p s Penalty >Iasimun~-~~ratrh value sun1 Real 2 1 0' 0 0 0 1-00 63.00 35.00 0.33 0 0.33 0.05 1-03 97-00 S9.83 0.05 0.05 1.05 25 0 0.65 0-67 0.25 0-25 0-2s 0 71-55 61-95 47.30 S Rrd J1inimu;n deletions Ran 53.60 1-SO 9i.SO Yl-47 80.25 64-58 40.54 33-SO 2.09 1-33 4.11 4.SS 3.9; 1-11 1-59 S4B 4.27 1.16 :-SO 1.52 8.85 3; 4 36,s IS 24-3 3-8 1 46 3 0 -- d'i) -- 4.50 4 'i) 0 a is the estimated standnrddeviation: S. tho standardized value. (mal-randoom)\s. of t h o maximum match of the real proteins. The vnlues for type 3 and Q-pe 0 pairs werc 1-0and 0. respectivcly, in each variable set, t An average value from 10 samples. S. I!. N E E D L E M A N A S D C . D. WUNRCH TABLE2 Match values for pair type3 - 1 Penalty Maximum-match vnlue sum * Red x Real Randomi 1 Minimum deletions Real Rsndornt 34 29-2 5.2 ....._ 0 0 0.67 0.67 0.25 0.25 0.25 0 0 0.33 0.33 0.05 0.05 0.05 0 1.00 0 1.03 0 1.0: 25 48.00 23.00 58-33 67.93 56.00 33-50 28.15 44.20 22.00 56.15 65.37 32.26 33.02 27-67 2.56 1-73 0.82 1.27 2-12 1.66 1.75 1-48 0.58 2.64 0-43 1.77 0-41 0.22 5 21 18.8 2 35 2.2 35-5 8 0 6.8 0 8 is the estimated standard deviation; X, the standardizedvalue, (red-random)/s. of the maximum match of the real proteins. The values for type 3 and type 0 pairs were 1-0 and 0, respectively in each variable set. t An average value from 10 samples. amino acids in p-hemoglobin and myoglobin can be matched. To attain this match, however, it is necessary to permit at least 35 gaps. In contrast, when two gaps are allowed according to Braunitzer (19G5), i t is possible to match only 37 of the amino acids. Curiously, 11-hen this nriable set was used for comparing hwn.an myoglobin (Hill, personalcommunication)withhuman j3-hemoglobin, the maximummatch obtained was not significant. Differences between real and random values were highly significant, however, when ot.her variable sets were used. Variable set2 attaches a penalty equalto the ralueof one identica1 amino acid pair to the search for identical amino acid pairs. This penaltywill exclude fromconsideration any possible pat,hway that leayes and returns to a principal diagonal, thereby needing twogaps, in order to addonly oneor two amino acids to themaximum match. This set results in a total of 30 4 = 42 amino acids matched (the maximum-match value plus the number of gaps is reduced to four) and the significance of the result relative to set 1 appears tobe increased. Braunit.zer’s comparison would have a value of 35 - 2 = 35 using this variable set, hence i t 11-3snot selected by the method. Variable sets 3 and 4 have an interesting property. Their maximum-match ralues can be related to the minimum number of mutat.ions needed to convert the selected parts of one amino acid sequence into the selected parts of the other. Theminimum number of mutations concept in protein comparisons was first advanced by Fitch (1966). If the t.vpe ralues for these sets are multiplied by three, they become equal to their pair type and directly represent the maximum numberof corresponding bases in the codons for a giren amino acid pair. Thus t,he masimum match and penalty factors may be multiplied by three, making i t possible to calculate the masimum number of bases matched in t.he combination of amino acid pairs selected by the marimum-mat.ch operat.io1-h. fl-Hemoglobin,t.he smaller of the two proteins, contains 146 amino acids; consequentlythehighesi possible mssimum mat.ch (disregarding integralamino-acid composition data) with myoglobin is 146 X 3 = 435. Insufficient data are al-aiiabh + S I M I L A R I T I E S IK AMINO A C I D S E Q U E N C E 451 to analyze the result from set 3 on the basis of mutations. If i t is assumed that the gap in set 4 does not exclude any partof /3-hemoglobin fromthe comparison, this sethas a maximum of 3(89-63 1.03) = 272 bases matched, indicating a minimum of 43s 272 = 166 point mutations in this combination. Using this variable set and placing + ' . - gaps according to Braunitzer, a score of 88.6 was obtained, thus his match was not selected. Again it may be obsert-ed that thepenalt? greatly enhanced thesignificance of the maximum match. Variable sets 5 and 6 have no intrinsic meaning and were chosenbecause the weight attached to type 2 and type1 pairs is intermediate in value with respectto sets 1and 2 and sets 3 and 4. The maximum match for set 6 is seen to hare a highly significant value. The data of set 7 are results that would be obtained from using the frame-shift method to select a masimum match; the penalty was large enough to prevent any gaps in the comparisons. The slightdifferences in significance found amongthe maximum-match valuesof 8-hemoglobinand myoglobin resulting from use of sets 4,6 and 7 are probably meaningless due to small sample size and errors introduced by the assumptions about the distribution functions of random values. Finding a value in set 7 that is approximately equalto those from sets 4 and 6 in significance is notsurprising. A larger penalty factor would hare increased the difference from the mean in sets 4 and 6 because almost every randomralue in each set was the resultof more gapsthan were required to form the real maximum match.Further, the gaps that allowed are are at the E-terminalends so that about 85% of the comparison can be made without gaps. If an actual gapwere present near the middle of one of the sequences, i t would have caused a sharp reduction in thesignificance of thc frame-shift t.ype of match. Set 3 is the only variable set in Table2 that shows a possible difference. Assuming the value is accurate, other than chance, there is no simple explanation for the difference. A small but meaningful differencein any comparisoncouldrepresent evolutionary divergence or convergence. It is generally accepted that the primary structure of proteins is the chief determinant of the tertiary structure. Because certain features of tertiary structure are common to proteins., it is reasonable to suppose that proteins will exhibit similarities in their sequences,and thatthese similarities w l ibe sufficient to cause a si,anificant difference between most protein pairs and their corresponding randomized sequences, being an example of submolecular evolutionary convergence. Further, the interactions of the protein backbone, side chains, and thesolvent that determine tertiary structure are,in large measure, forces 8-g from the polarity and steric nature of the protein side-chains. There are conspicuous correlations in t.he polarity and steric nature of type 2-pairs. Heavy weighting of thcse pairs would be expected to enhance the significance of real rnaximum-match values if common structural feat.ures are present in proteins that are compared. The presence of sequence similarities does not always imply comnlon ancestrfr in proteins. More experimentation \Idill be required before a choice among the possibilities suggested for the result from set 3 can be made. If sereral short sequences of amino acids are common to a11 proteins, it seems remarkable that the relationship of ribonuclease to lysozyme in six of the serennriable sets appears to be truly a random one. It shouId be noted, howerer,that thest.andard d u e of t . 1 real ~ maximum-match is positive in each variable set in this comparison. This method was designed for t.he purpose of detect.ing homology and defining its nature when it can be shown to exist.. Its usefuluess for the above purposes depends in '. ' 1 452 S. 13. SISEDLEI\f.+S A S D C. D. W U N S C H part upon assumptions rclatcd to the genetic events that could have occurred in thd\ 3 evolution of proteins. Starting with thc assumption that homologous proteins a= tb? rcsult. c?f gene duplicationandsubsequcntmutations, i t is possible to Comtruet’ several hypothetical amino-acidscquences that would be expected to show homologp. Tf onc assumcs that follou-ing the duplication, point mutations occur at a constant or variable rate, but randomly, along t.he genesof the two proteins, after a relatively 1 short period of time the protein pairs will have nearly identical sequences. Detection of the high degree of homology present can be accomplished by several means. The use of values for non-identical pairs will do little to improve the significance of the results. If no, or very few, deletions (insertions) have occurred, one could expect to enhance thesignificance of the match by assigning a relatively high penaltyfor gap. : Later on in time the hypothetical proteins may have asizable fraction of their codom,; changed by point mutations,the result being that an attemptto increase the si-. . came of the maximum match will probably require attaching substantial weight to those pairs representing amino acids still having two of the three original bases in their codons. Further, if a few moregaps hare OCCUR^^, the penaltyshould be reduoed to a small enough valueto allow areas of homology ta be linked to one another. At a still Iater date in time more emphasis must be placed on non-identical pairs, and perhaps a very small or even negative penalty factor must be assessed. Eventually, it wiLl be impossible to detect the remaining homology in the h-vpothetical exampIe by using the approach detailed here. From considerationof this simple model ofprotein evolution one maydeduce that the variables which maximizethe significance ofthe difference betueen real andrandom proteins gives an indication of the nature of the homology. In the comparison of human 8-hemoglobin to whale myoglobin, the assignment of some weight to type 2 pairsconsiderablyenhances the significance of the result, indicating substantia1 erolutionary divergence. Further, few deletions (additions) have apparentlyoccurred. It is known that theevolutionary di\-ergencemanifested by cytochrome (Margoliash, Needleman & Stewart, 19G3)and other heme proteins (Zuckerkandl8z Pauling, 1965) did not follow the sample model outlined above. Their dirergence is the result of tmr-random mutations along the genes. The degree and type of homology can be expected t o differ between protein pairs. A s a consequence of the difference there is no u priori best set of cell and operation values for masimizing the significance of a masimum-mat,ch valueof llomologous proteins, and as a corollary to this fact, there is no best set. of Talus for the purpose of detecting only slighthomology. This is an important consideration, because whether the sequence relationship between proteins is significant depends solely upon the cell and operation values chosen. If it is found that the divergence of proteins follows one or two simple models, it may be possible to derive a set of values that will be most useful in detecting and d e h i n g homology. The most. common met hoc1 for determining the degree of homology between protein pairs has bcen to count thc number of non-identical pairs (amino acidreplacements) in the homologous co~nparison andto use this number as a measure of evolutionary distance between the amino acid sequences. -1second, more recentconcept has been to count the minimurn number of mut.ations represented by the non-identical pairs. This number is probably 3 more adequate measureof evolutionary distancebecause it utilizes more of t . 1 ~ arailablc infonuation and theory to give some measure of the number of genetic events that have occurred in the erolution of the proteins. The approach outlined in this paper supplies either of these numbers. i: SIMILARITIES I K A J I I S O A C I D SEQUENCE 463 This work wm supported in part by grants t o one of us (S.B.X.) from tho U.S. Public Health Sorvico (1 501 FR 05370 02) and from JIcrck Sharp C Dohme. REFEREXES Braunitzer, G. (1965). In Evolving Genes and Proteina, ed. by . '1 Bryson & H. J. Vogel, . p. 183. New York: Academic Prcss. Canfield, R. (1963). J . Biol. Chcm. 238, 2G9Y. Eck, R. V. & DayhoE', 31. 0.(19GG). Atlas of Protein Sequence and Structrrre. Silver Spring. Maryland: National Biomedical Research Foundation. Edrnmdson, A. B. (19155).Nature, 205, 883. Fitch, W. (1966). J . 11101.EWZ. 16, 9. Konigsberg, W.,Goldstein, J. & Hill, R. J. (1963). J . Biol. Chem. 238, 2025. MargoIiash, E.,Needleman, S. B. C Stewart., .J. iY. (19G3).A c t a Clrem. S c a d . 17, S 250. Marshall, R. E., Caskey, C. T. 8: Sircnbcrg. 31. (19Gi). Science, 155, 820. X d l e m s n , S. B. & Blair. T.H.(1969). Proc. S a t . d c a d . Sci., Wash. 63, 1127. Smyth, D. G., Stein, W . G. 8: Moore, S. (1963). J. B i d Chem. 238. 227. Zuckerhdl, E. 8: Pauling, L. (1965). I n EcoZci~yCejles and Proteins, cd. by V. Bryson & E.J. Vogel, p. 95. Xew York: Academic Press.