Download Estimating the ``Effective Number of Codons`

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene desert wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Gene nomenclature wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome evolution wikipedia , lookup

Microevolution wikipedia , lookup

Messenger RNA wikipedia , lookup

Epitranscriptome wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene wikipedia , lookup

Point mutation wikipedia , lookup

Transfer RNA wikipedia , lookup

Frameshift mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Expanded genetic code wikipedia , lookup

Genetic code wikipedia , lookup

Transcript
Copyright 2006 by the Genetics Society of America
DOI: 10.1534/genetics.105.049643
Estimating the ‘‘Effective Number of Codons’’: The Wright Way of
Determining Codon Homozygosity Leads to
Superior Estimates
Anders Fuglsang1
Danish University of Pharmaceutical Sciences, Copenhagen DK2100, Denmark and Norwegian Medicines Agency, N0950 Oslo, Norway.
Manuscript received August 16, 2005
Accepted for publication October 12, 2005
ABSTRACT
In 1990, Frank Wright introduced a method for measuring synonymous codon usage bias in a gene by
estimation of the ‘‘effective number of codons,’’ Nc. Several attempts have been made recently to improve
Wright’s estimate of Nc, but the methods that work in cases where a gene encodes a protein not containing
all amino acids with degenerate codons have not been tested against each other. In this article I derive five
new estimators of Nc and test them together with the two published estimators, using resampling under
rigorous testing conditions. Estimation of codon homozygosity, F, turns out to be a key to the estimation
of Nc. F can be estimated in two closely related ways, corresponding to sampling with or without replacement, the latter being what Wright used. The Nc methods that are based on sampling without replacement
showed much better accuracy at short gene lengths than those based on sampling with replacement,
indicating that Wright’s homozygosity method is superior. Surprisingly, the methods based on sampling
with replacement displayed a superior correlation with mRNA levels in Escherichia coli.
N
ONRANDOM usage of synonymous codons is a
phenomenon that has been studied in a wide range
of life forms. Often, the biased usage of synonymous
codons is caused by translational selection; i.e., highly
expressed genes tend to have a set of codons corresponding to the more abundant tRNA species (Ikemura
1981, 1985; Gouy and Gautier 1982). The study of this
phenomenon relies on quantification of the codon usage
bias, and for this purpose several one-dimensional statistics have been proposed (reviewed by Comeron and
Aguade 1998). Frank Wright introduced an intuitive
measure of codon in 1990, termed the ‘‘effective number
^ c ) used in a gene. The idea is simple: one
of codons’’ (N
assigns to a gene a number between 20 and 61 that tells
to what degree the entire genetic code is used. A value
of 20 indicates that just one codon is used for each
amino acid (extreme bias), while a value of 61 indicates
that all codons are used equally (no bias). To calculate
^ c for a gene, one needs to have knowledge about the
N
codon ‘‘homozygosity’’ (F^ aa , explained in the appendix)
for individual amino acids. For individual amino acids
^ caa is given by F^ aa 1. Wright’s formula was
N
^c ¼ 21 9 1 1 1 5 1 3 ;
N
F^ 2 F^ 3 F^ 4 F^ 6
ð1Þ
where F^ i denotes the average homozygosity for the class
with i synonymous codons. The coefficients 9, 1, 5, and 3
1
Address for correspondence: Danish University of Pharmaceutical Sciences, 2 Universitetsparken, Copenhagen DK2100, Denmark.
E-mail: [email protected]
Genetics 172: 1301–1307 (February 2006)
come from the number of amino acids belonging to
the different classes. Since exactly one codon is used by
methionine and tryptophan, they have by definition
always one effective codon each, so two are added without further calculation. Note that there is averaging
involved in this formula; some genes encode proteins
in which not all amino acids are present (or present
in such low counts that calculation is problematic, see
^ caa needs to be approprilater), and in such cases the N
ately approximated. In Wright’s original method, the
problem of missing amino acids was solved by assuming
that the F^ aa -value of the missing amino acid is equal
to the mean of the F^ aa -values of the other amino acids
belonging to the same degeneracy class. For example, if
in a gene we have alanine and glycine codons missing, it
is assumed that their F^ aa-values are equal to the mean of
the F^ aa -values for proline, threonine, and valine, giving
an F^ 4 that is based on three values. Recently, it was
shown that in Escherichia coli there is a poor correlation
between the F^ aa -values within a degeneracy class, so
using the average of the F^ aa -values generally gives a
rather poor estimate (Fuglsang 2004). In the same
work it was illustrated how this type of averaging may
^ c ; this happens
lead to a systematic underestimation of N
when there is ‘‘bias discrepancy,’’ which was qualitatively
defined as the phenomenon of observing a strong bias
for one amino acid while observing weak bias for another amino acid having the same degeneracy. That
work ended up in a proposal about adding individual
^ caa -values, yielding an estimator called N
^ *,
N
and a
c
simulation study suggested this to be a safe choice in
1302
A. Fuglsang
long genes. Marashi and Najafabadi (2004) pointed
out a potential weakness in the methodology since there
^ caa -values exceed the
is a chance that the individual N
degeneracy. This has to do with the way F^ aa is calculated.
I here refer the reader to the appendix, in which I have
described the two methods for F^ aa calculation. Wright’s
way of calculating F^ aa (A2) can be viewed as involving
sampling without replacement, while an alternative
formula (A1) can be viewed as involving sampling with
replacement. The latter was then proposed to be useful
^ **
in an estimator called N
c (Fuglsang 2005), which also
^ caa -values. Alwas based on addition of all individual N
though this estimator would converge toward the ‘‘true’’
Nc value with increasing gene lengths, the problem of
missing amino acids was not solved, as Banerjee et al.
(2005) pointed out. They therefore suggested an estimator that combined Wright’s way of averaging within
degeneracy classes with the F^ aa -calculation based on
sampling with replacement, but they did not test their
idea in a controlled experiment.
As mentioned earlier averaging is associated with a
systemic error if there is bias discrepancy, and since this
is apparently often the case (see Fuglsang 2005), I believe there is a need for an estimator that is resistant to
bias discrepancy, preferably one that is also capable of
handling missing amino acids.
The purpose of this article is to suggest such estimators. I introduce some new ideas about how to handle
the problem of missing amino acids, giving rise to five
new estimators. The best way to test the behavior of such
bias estimates is to test them under conditions that can
be fully controlled. This can be achieved by simulation/
resampling (Wright 1990; Comeron and Aguade
1998; Fuglsang 2004, 2005). Here, all of the six methods that allow estimation of Nc when amino acids are
missing (or present in too low an amount to allow calculation of F^ aa ) are tested by resampling. Finally, in a recent
article a surprisingly good correlation was observed be^ **
tween N
c and mRNA levels in E. coli. The new estimates
are tested in a similar fashion for their correlation with
^ *c and
mRNA levels in E. coli. It should be noted here that N
^ **
N
will
not
receive
any
focus
in
this
article,
since
these
c
methods require that no amino acids are missing.
To avoid too much further confusion and word clutter, all estimators tested in this work are given unique
subscripts, and Wright’s estimator in the following is
^ cW while the method of Banerjee et al.
referred to as N
^ cB.
(2005) is referred to as N
synonymous alternatives are not used at all), but if all degenerate codons are used equally DU will be 1.0 (100% usage of
the synonymous alternatives).
The general formula for estimating DU is
^
d aa ¼ N caa 1 ;
DU
Degaa 1
ð2Þ
d aa is
where Degaa is the degeneracy. Note the caret symbol: DU
d aa -value can be obtained for all amino acids
an estimate. A DU
^ caa can be calculated. The average
for which F^ aa and thereby N
d
d
^ caa of the DUaa -values (DU) can then be used to estimate a N
value for each missing amino acid, by rearranging Equation 2.
Thus, for all amino acids other than methionine and tryptophan (for which we by definition set Ncaa ¼ 1), we have
8
1
>
<
if possible
^
^ caa ¼ F aa
N
ð3Þ
>
:d
DUðDeg 1Þ 1 1 else:
aa
^ c:
From this we will always be able to summarize an N
^ cF ¼ N
^ cala 1 N
^ carg 1 N
^ casp 1 1 N
^ cval :
N
4
^ caa -value for missing amino acids is
In other words, the N
approximated through knowledge of the average codon bias.
In principle this resembles the original method by Wright, but
in contrast to his method the knowledge about bias for as many
amino acids as possible is included in each estimate. As an
advantage we can in principle calculate the estimate for genes
as short as one codon (apart from the start and stop codons,
provided that the codon encodes a degenerate amino acid).
Note that Equation 2 has two different implementations, depending on the method used to measure F^ aa . In the following,
when Equation A1 is used to estimate the homozygosity,
^ cF , and when Equation A2 is
the estimate is referred to as N
^ cF .
used the estimate is referred to as N
^ c based on relative homozygosity: Quite
Derivation of an N
similarly to the method listed above we can define a relative
homozygosity estimate; we note that, in theory, F^ aa is a value
between 1/Degaa and 1.0. We can normalize this by
41
4
F^ aa ð1=Degaa Þ
;
F^ rel;aa ¼
1 ð1=Degaa Þ
ð5Þ
which has a value between 0 and 1 across all amino acids in all
degeneracy classes. We can use this normalization to obtain
an average relative homozygosity estimate F^ rel . On the basis of
this average we get
8
1
>
>
if possible
>
< F^
aa
^
ð6Þ
N caa ¼
1
>
>
else;
>
: ^
ðF rel ð1 ð1=Degaa ÞÞ 1 ð1=Degaa ÞÞ
^ c by summation. In the
and this also allows us to make an N
following, when Equation A1 is used to calculate the homo^ cF , and when Equazygosity, the estimate is referred to as N
^ cF .
tion A2 is used the estimate is referred to as N
^ c based on error-correction of N
^ c : As
Derivation of an N
W
shown previously (Fuglsang 2004) there is a methodological
^ cW because it averages the F^ aa -values in
weakness built into N
degeneracy classes. This adds some systematic error, e, in the
estimator, and we might be capable of estimating this by using
a little trick. Formally we can write
51
5
MATERIALS AND METHODS
^ c based on degeneracy usage: We might
Derivation of an N
want to incorporate as much knowledge about codon bias as
possible in estimates for codon bias for missing amino acids.
Let us consider the degree by which the codons for a given
amino acid represent full usage of all synonymous alternatives.
In the following this is referred to as ‘‘degeneracy usage’’
(DU). If only one codon is used then DU is 0.0 (zero percent,
ð4Þ
^ cW ¼ ‘‘true’’ Nc 1 e;
N
where e is a function of the gene.
Estimating the ‘‘Effective Number of Codons’’
1303
TABLE 1
^ c and their calculation
The different types of N
F^ -formula
P P 2
ni
p 1
F^ aa ¼ P i
^ c , formula
Estimate of N
^ cw ¼ 2 1 9 1 1 1 5 1 3
N
F^ 2 F^ 3 F^ 4 F^ 6
ni 1
Reference
Comment
Wright (1990)
Requires that at least one F^ -value
can be calculated for degeneracy
classes 2, 4, and 6.
^ *c ¼ N
^ cala 1 N
^ carg 1 N
^ casp 1 1 N
^ cval
N
F^ aa ¼
P P 2
n
pi 1
Pi
Fuglsang (2004)
Requires that the F^ -value can be
calculated for all amino acids.
^ **
^
^
^
^
N
c ¼ N cala 1 N carg 1 N casp 1 1 N cval
F^ aa ¼
P
pi2
Fuglsang (2005)
Requires that the F^ -value can be
calculated for all amino acids.
^ cB ¼ 2 1 9 1 1 1 5 1 3
N
F^ 2 F^ 3 F^ 4 F^ 6
F^ aa ¼
P
pi2
Banerjee et al.
(2005)
^ cF
N
4
^ cala 1 N
^ carg 1 N
^ casp 1 1 N
^ cval
¼N
F^ aa ¼
P P 2
n
pi 1
Pi
Requires that at least one F^ -value
can be calculated for degeneracy
classes 2, 4, and 6.
This article
Can always be calculated.
^ cF
N
41
^ cala 1 N
^ carg 1 N
^ casp 1 1 N
^ cval
¼N
F^ aa ¼
This article
Can always be calculated.
F5
^ cala 1 N
^ carg 1 N
^ casp 1 1 N
^ cval
¼N
F^ aa
This article
Can always be calculated.
^ cala 1 N
^ carg 1 N
^ casp 1 1 N
^ cval
¼N
F^ aa ¼
This article
Can always be calculated.
This article
Requires that at least one F^ -value
can be calculated for degeneracy
classes 2, 4, and 6.
^c
N
^ cF
N
^ ce
N
51
9
1
5
3
¼ 2 1 1 1 1 ê
^
^
^
^
F
3
F2
F4 F6
F^ aa
ni 1
ni 1
P 2
pi
P P 2
ni
p 1
¼ P i
P
ni 1
pi2
P P 2
ni
p 1
¼ P i
ni 1
^ *c and N
^ **
N
are outside the scope of this article.
c
In the previous report (Fuglsang 2004) an example was
^ cW would be exactly 4.5 off
given that showed when and how N
for lengths going toward infinity. As is demonstrated in the
^ cF , (and others) converges toward the
following section, N
true Nc with increasing length. When we have a gene in which
^ c using whichever method, we could assume that
we estimate N
the true codon probabilities correspond to those actually
^ cW and N
^ cF
observed in the gene at hand and calculate N
at infinite length with the actual codon probabilities. The
difference is an estimate of the error that is caused by the
^ cW :
averaging for N
4
4
^ cW;‘ N
^ cF :
ê ¼ N
4;‘
ð7Þ
Here the ‘ subscript has been added to indicate the infinite
length. In practice, it is easily approximated just by multiplying
all codon counts with a large integer (e.g., 1,000,000, which typically puts the uncertainty on fifth decimal) and calculating
^ cW on the resulting counts. Note that at infinite length we
N
^ cW and
expect that (A2) and (A1) reach the same value, as do N
^ cB. Also N
^ cF , N
^ cF , N
^ cF , and N
^ cF approach similar values,
N
so Equation 7 can be written in many ways. Note also that
^ cF
^ cF . The estimator calculated as N
^ cW ê is
N
equals N
^ ce in the following.
referred to as N
^ cW , N
^ cB ,
Simulation: To test accuracy and precision of N
^ cF , N
^ cF , N
^ cF , N
^ cF , and N
^ ce (all summarized in Table 1)
N
simulation was used. Genes were initially simulated using the
same amino acid composition as Wright originally used and
with various sets of codon frequencies, corresponding to no
bias (true Nc ¼ 61.0), weak bias (true Nc ¼ 53.0), medium bias
(true Nc ¼ 40.5), and strong bias (true Nc ¼ 29.0). Furthermore, for the medium bias (true Nc ¼ 40.5) two different sets
of codon frequencies were used: one corresponding to no bias
discrepancy and one corresponding to bias discrepancy as
explained in a previous article (Fuglsang 2004). The simu41
4
4;‘
41
51
5
lated gene lengths (measured in codons) were: 50, 60, 70, 80,
90, 100, 110, 120, 130, 150, 200, 300, 500, 900, 1250, and 2000.
This range of values was chosen on the basis of the distribution
of lengths in E. coli (see Figure 1), but since the chance that
amino acids are missing increases with shorter gene lengths, it
was decided to go as low as 50 codons length. Each combination of bias and gene length was resampled 10,000 times as in
the article by Comeron and Aguade (1998) and average and
standard deviations of the estimates were recorded. Finally, as
is explained further in results and discussion and as shown
previously (Fuglsang 2004), the choice of codon frequencies
and possibly also amino acid usage employed in the simulation
experiment will heavily influence the accuracy of the estimator. For that reason, the simulations were also carried out with
sets of codon and amino acid frequencies actually observed in
selected genes from the genome of E. coli K12 genes (GenBank
accession no. NC_000913).
Correlation with expression levels: The correlation of
^ c ’s with mRNA levels obtained from E. coli grown
different N
in a rich medium was tested with nonparametric correlation
analysis as described previously (Goetz and Fuglsang 2005).
The raw expression data were from Bernstein et al. (2002).
41
4
51
5
RESULTS AND DISCUSSION
The type of homozygosity estimation is a key to both
accuracy and precision: Figure 2 shows the results of the
simulations using a true Nc of 40.5 (medium bias) and
without bias discrepancy. All estimates converge toward
the true Nc with increasing length, and this also holds
^ cW
true for other levels of true Nc (data not shown). N
^ ce appear to be the methods most independent of
and N
1304
A. Fuglsang
Figure 1.—Length distribution of genes in
E. coli. The histogram was generated with a bin
width of 10. The insert is the same histogram with
a logarithmic axis.
length. There is a noteworthy difference in standard
deviations: methods using homozygosity based on
^ c B, N
^ cF , and N
^ cF ) have
sampling with replacement (N
lower standard deviations than those based on sampling
^ cW , N
^ cF , N
^ cF , and N
^ ce ). The
without replacement (N
pattern is the same at the other levels of bias, i.e., true
Nc ¼ 29, 53, and 61 (data not shown). When a regimen
with bias discrepancy is used (Figure 3) the situation
^ cW and N
^ cB converge toward the
is slightly different: N
wrong value of 36.0 (for an explanation of this phenomenon, see Fuglsang 2004), while the other estimates
approach the correct value. Again, methods using ho^ cB,
mozygosity based on sampling with replacement (N
^ cF , and N
^ cF ) have lower standard deviations than
N
^ cW,
those based on sampling without replacement (N
^ cF , N
^ cF , and N
^ ce ). These results illustrate that bias
N
^ cW and N
^ cB, and,
discrepancy is a serious problem for N
equally important, that the outcome of a simulation
study is highly dependent on the conditions used for the
41
4
41
4
51
5
5
51
simulation. There are an infinite number of ways to implement a true Nc of 40.5, and Figure 2 shows an implementation that favors all the Nc estimates, while Figure 3
shows an implementation that clearly favors methods
that do not average homozygosities within degeneracy
classes. The crucial part of this type of experiment is
thereby not so much the bias—the true Nc—but rather
the way a particular bias level has been implemented.
Are both simulations relevant or realistic? This is
fundamentally very difficult to answer because true Nc
is never known for genes found in nature. The poor
correlation between intraclass F^ aa -values that was observed in a previous study (Fuglsang 2004) seems to
suggest that bias is seldom uniform within classes. To get
closer to an answer to the question, a series of experiments were performed, in which the codon and amino
acid frequencies precisely reflect those observed in individual E. coli genes for which all 61 codons are present
at least once. In such cases we have a known true Nc for
Figure 2.—Average (left) and
standard deviation (right) of the
estimated Nc, at a ‘‘true’’ Nc of 40.5
(dashed line) without bias discrep^ cW; h, N
^ cB; d,
ancy. Symbols: n, N
^ cF ; s, N
^ cF ;
^ cF ; ),
N
, N
^ cF ; and 3, N
^ ce. Note that
N
the horizontal axis is logarithmic.
The methods based on calculation
of F^ aa without replacement (solid
symbols) show the highest standard
deviations, and all converge toward
the true Nc of 40.5 with increasing
length.
4
51
41
¤
5
Estimating the ‘‘Effective Number of Codons’’
1305
Figure 3.—Average (left) and
standard deviation (right) of the
estimated Nc, at a ‘‘true’’ Nc of
40.5 (dashed line) with bias discrep^ cW ; h, N
^ cB;
ancy. Symbols: n, N
^ cF ; s, N
^ cF ;
^ cF ; ),
d, N
, N
^ cF ; and 3, N
^ ce. Note that the horN
izontal axis is logarithmic. The
methods based on calculation of
F^ aa without replacement (solid
symbols) show the highest standard
deviations (right), and the methods
based on averaging of F^ aa within
^ cW and N
^ cB )
degeneracy classes (N
converge toward the wrong value
of 36 (left).
4
41
¤
5
51
a simulation experiment on the basis of the observed
codon and amino acid frequencies of actually observed
genes (but the reader should not misinterpret this as
having knowledge about true Nc for any actual gene).
Figure 4 shows an example of such a simulation with
realistic (that is, actually observed) codon and amino
acid frequencies. The data for the figure have been
made on the basis of the gabP gene of E. coli. Note that
^ ce is superior here in terms of accuracy. Experiments of
N
this type reveal that, generally, estimators that are based
on Equation A2 for homozygosity calculation seem to
reach a maximum at a length of some 100–300 codons,
^ ce is generally most accurate, except for
and that N
^ cW is somelengths just around the maximum where N
times (depending on the actual codon frequencies) more
accurate. This especially seems to be the case when ê ,
1.5. This clearly emphasizes how important the effect
of bias discrepancy, or ê, is for this type of study. A
histogram of the estimated errors, ê, for the genome of
E. coli K12 is given in Figure 5. The median ê is 1.7.
My interpretation is that bias discrepancy is a realistic
phenomenon that should not be disregarded because it
^ cW and N
^ cB and
leads to a systematic accuracy error for N
because it seems to be more of a rule than an exception
(Fuglsang 2004). As Figure 4 (and to a lesser extent
^ cW with ê
also Figures 2 and 3) suggests, correcting N
gives a superior estimate in many cases. Only at very
^ cF
short gene lengths (lengths 50–60 codons) can N
^
^
or N cF be superior to N ce , but at this point a clear recommendation about when this is the case cannot be
given. As indicated in Figure 1, gene lengths of 50–60
codons do occur in the genome of E. coli but must be
considered exceptional. Generally, I therefore recom^ ce as the best overall bias estimator. In reduced
mend N
genomes, such as that of Mycoplasma genitalium (0.6 Mb)
or intracellular parasitic bacteria, genes are often shorter
than their counterparts (orthologs) in E. coli and other
bacteria with larger genomes. Figures 2–4 suggest that in
such cases there could very well be relatively more genes
^ cF and N
^ cF are more accurate than N
^ ce.
for which N
There is therefore plenty of room for improvement of
^ ce. It seems an obvious future objective to characterize
N
better the relationships between length, accuracy, and ê.
The least accurate estimates show better correlations
with mRNA levels: In Table 2, the correlation coeffi^ c ’s and mRNA levels in E. coli
cients are given for the N
4
5
4
5
Figure 4.—Average (left) and
standard deviation (right) of the
estimated Nc, using the codon
and amino acid composition of
gabP from E. coli as reference for
the simulation (‘‘true’’ Nc ¼ 47.2).
^ cW ; h, N
^ cB; d,
Symbols: n, N
^ cF ; s, N
^ cF ;
^ cF ; ),
N
, N
^ cF ; and 3, N
^ ce. Note that the
N
horizontal axis is logarithmic. The
methods based on calculation of
F^ aa without replacement (solid
symbols) show the highest standard deviations (right), and the
methods based on averaging of F^ aa
^ cW
within degeneracy classes (N
^ cB ) converge toward the
and N
wrong value of 45.5 (left). Finally,
^ cW and N
^ ce show the best length
N
independence.
4
51
41
¤
5
1306
A. Fuglsang
Figure 5.—Histogram of ê-values from the genome of
E. coli K12. Bin width was 0.05.
grown in a rich medium (expression data from Bernstein
et al. 2002). Interestingly, Table 2 demonstrates that
^ c ’s that are
there is a considerable correlation with the N
^ cF ,
based on Equation A1 for codon homozygosity (N
^ cF , and N
^ cB ; rs 0.39), whereas those that are based
N
^ cW ,
on Equation A2 display much lower correlation (N
^
^
^
N cF , N cF , and N ce ; rs # 0.27). For comparison, the
correlation coefficient of the codon adaptation index
(CAI) with the mRNA levels is 0.43 (Goetz and
Fuglsang 2005). CAI is a frequently used indicator of
‘‘expressivity’’ but is species dependent in that the
calculation is based on knowledge of codons used in
genes that are highly expressed (Sharp and Li 1987).
^ c ’s are not dependent on such knowlThe different N
^ c** had a fair
edge. Previously it was shown only that N
correlation (rs ¼ 0.39) with mRNA levels in E. coli
(Goetz and Fuglsang 2005). This bias indicator is also
based on Equation A1, but relies on F^ aa being calculated
for all 18 degenerate amino acids. Of course, some
degree of correlation must be anticipated between CAI
^ c ’s. A CAI of 1.0 should always give an
and the different N
41
51
4
5
TABLE 2
Correlation of the different bias measures with mRNA
levels in E. coli grown in rich medium
Codon bias estimate
^ cW
N
^ cB
N
^ cF
N
^c
N
^ cF
N
^ cF N
^ ce
N
41
F4
51
5
Correlation with mRNA
levelsa in E. coli
rs
rs
rs
rs
rs
rs
rs
¼
¼
¼
¼
¼
¼
¼
0.2198b
0.3947
0.3939
0.2245
0.3936
0.2371
0.2637
Interestingly, methods based on Equation A1 for calculation of F^ display far better correlation coefficients than methods based on Equation A2.
a
Expression data are from Bernstein et al. (2002).
b
Data are from Goetz and Fuglsang (2005).
^ c of 20, but apart from that the relationship is not well
N
characterized for any species. Although the bias measures based on Equation A1 are not the most accurate,
their existence is certainly justified by the correlation
they display with mRNA levels in E. coli. It needs to be
mentioned that there is a known correlation between
length and codon bias in E. coli (Eyre-Walker 1996).
The correlation coefficient for length and mRNA levels
used in Table 2 is rs ¼ 0.2697 (P , 0.0001). If length
imposes a constraint on codon usage then that could in
principle account for some of the correlation between
codon bias and expression. The cause-consequence
relationships that exist between length, codon bias,
and expression are not well understood, however. Moreover, given that the estimates for codon bias have larger
variances for low gene lengths, and that this might also
be the case with the estimates of mRNA level, random
sampling effects in these areas could very well have an
influence on the correlations given in Table 2.
Concluding remarks In conclusion, this study has
^ c ’s based on Equation A1 are less accurate
shown that N
than those based on Equation A2, which was Wright’s
original suggestion. Wright’s idea about homozygosity
(Equation A2) is therefore definitively better than the
^ ce was generally
method I suggested (Equation A1). N
the best performer in terms of accuracy in simulation
studies where the codon and amino acid frequencies
were identical to frequencies actually observed in E. coli
^ c ’s based on Equation A1
genes. On the other hand, N
generally have lower standard deviations in simulation
studies. They also display a better correlation with
mRNA levels in E. coli. The amount of available
literature in this field is limited, and there is plenty of
opportunity for improvements.
LITERATURE CITED
Banerjee, T., S. K. Gupta and T. C. Ghosh, 2005 Towards a resolution on the inherent methodological weakness of the ‘‘effective
number of codons used by a gene.’’ Biochem. Biophys. Res. Commun. 330: 1015–1018.
Bernstein, J. A., A. B. Khodursky, P. H. Lin, S. Lin-Chao and S. N.
Cohen, 2002 Global analysis of mRNA decay and abundance in
Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays. Proc. Natl. Acad. Sci. USA 99: 9697–9702.
Comeron, J. M., and M. Aguade, 1998 An evaluation of measures of
synonymous codon usage bias. J. Mol. Evol. 47: 268–274.
Eyre-Walker, A., 1996 Synonymous codon usage bias is related to
gene length in Escherichia coli: Selection for translational accuracy? Mol. Biol. Evol. 13: 864–872.
Fuglsang, A., 2004 The ‘effective number of codons’ revisited. Biochem. Biophys. Res. Commun. 317: 957–964.
Fuglsang, A., 2005 On the methodological weakness of ‘the effective number of codons’: a reply to Marashi and Najafabadi. Biochem. Biophys. Res. Commun. 327: 1–3.
Goetz, R. M., and A. Fuglsang, 2005 Correlation of codon bias
measures with mRNA levels: analysis of transcriptome data from
Escherichia coli. Biochem. Biophys. Res. Commun. 327: 4–7.
Gouy, M., and C. Gautier, 1982 Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 10: 7055–7074.
Ikemura, T., 1981 Correlation between the abundance of Escherichia
coli transfer RNAs and the occurrence of the respective codons in
its protein genes. J. Mol. Biol. 146: 1–21.
Estimating the ‘‘Effective Number of Codons’’
Ikemura, T., 1985 Codon usage and tRNA content in unicellular
and multicellular organisms. Mol. Biol. Evol. 2: 13–34.
Marashi, S.-A., and H. S. Najafabadi, 2004 How reliable re-adjustment
is: correspondence regarding A. Fuglsang, ‘‘the effective number
of codons revisited.’’ Biochem. Biophys. Res. Commun. 324: 1–2.
Sharp, P. M., and W.-H. Li, 1987 The codon adaptation
index—a measure of directional synonymous codon usage
1307
bias, and its potential applications. Nucleic Acids Res. 15:
1281–1295.
Wright, F., 1990 The ‘effective number of codons’ used in a gene.
Gene 87: 23–29.
Communicating editor: S. Yokoyama
APPENDIX: EXPLANATION OF CODON HOMOZYGOSITY (F )
A basis in population genetics: Consider an infinitely large population, in which we have X different alleles for
a given trait and where one zygote carries one allele. The individual frequencies of the alleles/zygotes are p1, p2,
p3, . . . , pk. Let us then calculate the chance F that two randomly picked zygotes carry the same allele:
X
F ¼ p1 p1 1 p2 p2 1 1 pk pk ¼
pi2 :
ðA1Þ
If the two zygotes give rise to offspring, F is the chance that it will be homozygotic for the given allele. Conversely, 1/F
tells the average number of times we can pick two random zygotes before they have identical alleles. That is the
effective number of alleles.
Sampling with replacement: With the codons in a gene, we use this calculation to get an estimate for codon
homozygosity, F^ aa , although the direct interpretation then becomes somewhat obscure. A fair way of viewing F^ aa is to
understand it as the estimated probability that two codons picked at random from the codon pool (consisting of the
codons encoding the amino acid in question) are identical. Equation A1 is obviously applicable to a situation in which
there is replacement between the two codons picked: the first codon is put back into the codon pool before sampling
the second codon. This is the method proposed in Fuglsang (2005).
Sampling without replacement, Wright’s method: If we sample without replacement, we have to take into consideration that on sampling number two there is one less of the type picked in the first sample. Our F-estimate becomes
n1 n1 1
n2 n2 1
nX
nX 1
F^ ¼ P P
1P P
1 1P P
ni ni 1
ni ni 1
nX
nX 1
i
P 2 P
ð
n ni
¼
F^ ¼ P iP
ni ð ni 1Þ
P
P 2
P 2
P P P
n
ni
n
ni
P P 2
ni P i 2 P 2
ni Þ 2 P i 2 P 2
ð
ni Þ
ð
ni Þ
ð
ni Þ
ð
ni Þ
ni pi 1
P P
P
¼
¼ P
:
ni ð ni 1Þ
ni 1
ni 1
ðA2Þ
This equation is what Wright used in his original calculation. Obviously, if there is only one codon for a particular
amino acid in the gene (and the amino acid is not Trp or Met), or, more generally, if each synonymous codon is present
exactly one or zero times then obviously there is zero probability of sampling two such codons, meaning that the
effective number of codons cannot be estimated. Finally we note that Equation A2 becomes Equation A1 as the
population size (codons in total) goes toward infinity.
It cannot be emphasized enough that the choice of method is a matter of empery, first because it makes little sense
^ caa are estimates of the codon bias in a codon
to think of codons as mating entities and second because F^ aa and N
population that is represented by a gene; the gene is not the codon population itself. The caret symbol (^) is used to
indicate that it is an estimate, and consequently we have to discriminate carefully between Wright’s concepts of ‘‘true’’
^ c . True Nc and Faa are entities we usually have no way of knowing, while F^ aa and
Nc (Wright’s original wording) and N
^
^
thereby N caa and subsequently N c are values we estimate through measurements of the codon population in a gene. In
a simulation experiment we can fully control the true F-values, and thus the true Nc, by fixing the codon probabilities,
^ c behaves.
and test how the resulting N