Download A novel sensitive method for the detection of user

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Silencer (genetics) wikipedia , lookup

SR protein wikipedia , lookup

Biosynthesis wikipedia , lookup

Gene expression wikipedia , lookup

Expression vector wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Magnesium transporter wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup

Interactome wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Genetic code wikipedia , lookup

Biochemistry wikipedia , lookup

Metalloprotein wikipedia , lookup

Point mutation wikipedia , lookup

Protein purification wikipedia , lookup

Protein wikipedia , lookup

Western blot wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Proteolysis wikipedia , lookup

Transcript
BIOINFORMATICS
ORIGINAL PAPER
Vol. 22 no. 9 2006, pages 1055–1063
doi:10.1093/bioinformatics/btl049
Sequence analysis
A novel sensitive method for the detection of user-defined
compositional bias in biological sequences
Igor B. Kuznetsov and Seungwoo Hwang
Gen*NY*sis Center for Excellence in Cancer Genomics, Department of Epidemiology and Biostatistics,
University at Albany, State University of New York, One Discovery Drive, Rensselaer, NY 12144, USA
Received on June 26, 2005; revised on December 23, 2005; accepted on February 7, 2006
Advance Access publication February 24, 2006
Associate Editor: Christos Ouzounis
ABSTRACT
Motivation: Most biological sequences contain compositionally
biased segments in which one or more residue types are significantly
overrepresented. The function and evolution of these segments are
poorly understood. Usually, all types of compositionally biased segments are masked and ignored during sequence analysis. However,
it has been shown for a number of proteins that biased segments
that contain amino acids with similar chemical properties are involved
in a variety of molecular functions and human diseases. A detailed
large-scale analysis of the functional implications and evolutionary
conservation of different compositionally biased segments requires
a sensitive method capable of detecting user-specified types of
compositional bias.
Results: We present BIAS, a novel sensitive method for the detection
of compositionally biased segments composed of a user-specified
set of residue types. BIAS uses the discrete scan statistics that
provides a highly accurate correction for multiple tests to compute
analytical estimates of the significance of each compositionally biased
segment. The method can take into account global compositional
bias when computing analytical estimates of the significance of local
clusters. BIAS is benchmarked against SEG, SAPS and CAST programs. We also use BIAS to show that groups of proteins with the
same biological function are significantly associated with particular
types of compositionally biased segments.
Availability: The software is available at http://lcg.rit.albany.edu/bias/
Contact: [email protected]
Supplementary information: Supplementary data are available at
Bioinformatics online.
INTRODUCTION
Most protein and DNA sequences contain compositionally biased
segments. In these segments one or more residue types are significantly overrepresented. Whether or not a given sequence segment
has a compositional bias depends on the choice of background
model used to define an unbiased sequence. In most biological
applications residue positions in a protein or DNA sequence are
modeled as a sequence of discrete events generated according to the
random independence model. In this model a residue in sequence
position j, aj, is generated according to its probability, p(aj), and this
probability is independent of the rest of the positions (0-th order
Markov model). Usually, p(aj) is the same as the frequency of
To whom correspondence should be addressed.
residue aj computed using a large collection of protein or DNA
sequences (some sequence database). The random independence
model is used to assess the significance of database hits in the
most popular and efficient sequence similarity search algorithms
such as BLAST (Altschul et al., 1997). Therefore, compositionally
biased segments that violate the assumptions of the random independence model may lead to incorrect estimates of statistical significance in sequence similarity searches and wrong functional and
evolutionary inference.
Identification of compositionally biased segments is important
not only in sequence similarity searches. DNA sequences contain
a large fraction of diverse biased segments with poorly understood
function. More than 50% of all proteins have been shown to contain
biased segments (Wootton and Federhen, 1996). Some researchers
believe that they are similar to junk DNA and are mostly determined
by the compositional bias on the level of coding DNA sequence
(Nishizawa and Nishizawa, 1998; Huntley and Golding, 2000;
Singer and Hickey, 2000). Others suggest that biased segments
in proteins are not just junk sequences and are involved in a variety
of molecular functions: protein active sites, protein–DNA and
protein–protein interactions (Karlin et al., 2003), transcription
regulation (Brendel and Karlin, 1989), membrane transport
(Kreil and Ouzounis, 2003), structural function (Karlin et al.,
2003), peptides at protein termini (Berezovsky et al., 1999). In
the case of a new protein that does not have detectable homologs,
weak but statistically significant compositionally biased segments
with certain properties can be one of few clues that the researcher
can use to make a biologically meaningful guess about protein’s
function and structure. Compositionally biased protein segments
have also been implicated in a number of human diseases such
as prion diseases, Huntington’s disease, cancer and others
(Harrison and Gerstein, 2003; Karlin et al., 2002). It has been
proposed that certain types of these segments are overrepresented
in proteins involved in neurological disorders (Karlin and Burge,
1996). Unlike proteins, DNA sequences are composed only of
four residue types and can be quite long. As a result, compositional bias in DNA is considerably harder to deal with than that
in proteins.
There are two types of compositional bias: global and local. In the
case of global bias, the entire protein sequence contains a large
excess of particular residue types (an excess of hydrophilic residues,
for instance) and becomes one long compositionally biased segment. In the case of local bias, most of the protein sequence
conforms to the random independence model with only relatively
The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
1055
I.B.Kuznetsov and S.Hwang
small local clusters of over- or underrepresented residue types. Such
locally biased segments may be the most likely candidates for
functionally and/or structurally important sites. A number of
methods that identify and mask compositionally biased segments
by replacing them with lowercase letters or a string of ‘X’s in
protein and ‘N’s in DNA sequences have been developed. The
popular SEG algorithm uses a sliding window and entropy-based
measure to detect regions with low-information complexity called
low-complexity regions (Wootton and Federhen, 1996) that correspond to compositionally biased segments. Masking low-complexity
regions with SEG significantly improves the specificity of sequence
similarity search and it has become a de facto standard in protein
BLAST searches. The SEG algorithm uses equal prior probabilities
of residue types and segmentation threshold, chosen based on
random sequences, and therefore provides no estimates of statistical
significance of the detected low-complexity regions. It has been
proposed that SEG reports too many low-complexity segments
and introduces an artificial bias into the distribution of their length
owing to a pre-set size of the window used to identify seed segments
(Kreil and Ouzounis, 2003). CAST is another method for detecting
compositionally biased segments (Promponas et al., 2000). It uses
pairwise sequence alignment of query sequence with homopolymer
sequences (20 homopolymer sequences in the case of proteins) and
assigns significance based on the local alignment statistics. SIMPLE
is a method for the identification of simple repeats in protein and
DNA sequences that estimates statistical significance by using
randomly generated sequences of the same length and composition
as those of the test sequence (Alba et al., 2002). Neither of these
methods is suitable for the identification of short biased segments
composed of a user-specified subset of residue types with similar
chemical properties.
The first statistical method designed for the identification of
compositional bias by searching for statistically significant clusters
of amino acid residues with similar chemical properties was proposed by Karlin et al. (1990) and implemented in program SAPS
(Brendel et al., 1992). In this method, an amino acid sequence of
length N is represented as a sequence of successes and failures in
N independent Bernoulli trials. Successes correspond to residues
that belong to a pre-selected group of similar amino acids (charged
residues, for instance) and failures correspond to the rest of the
20 amino acids. Unusual clusters of amino acids from the group
of interest are identified by counting the number of successes in a
sliding window of fixed size. The significance of each window is
estimated using the normal approximation to the binomial distribution. This approach has two major limitations. First, in order for
the normal approximation to be valid window size must be sufficiently large. Second, this simple model does not provide a correction for multiple tests, a problem that arises from the use of all
possible windows. In order to deal with multiple tests, SAPS
uses three conservative p-value thresholds for proteins shorter
than 750 residues, proteins with length 750–1500 residues, and
proteins longer than 1500 residues. Because of these limitations,
SAPS may miss weak but still significant and biologically informative signals in a form of small or low-density clusters. The program
was designed neither for masking compositionally biased segments
nor for the analysis of DNA sequences. Recently, a statistical
method for finding compositionally biased regions in proteins
similar to SAPS has been proposed (Harrison and Gerstein,
2003). This method also counts the number of successes in a
1056
sliding window and uses the exact binomial test to estimate statistical significance of each window. As in the case of SAPS, there is
no explicit analytical correction for multiple testing. None of
the above methods separates global compositional bias from the
local one.
Here we present BIAS, a novel method and software for the
identification of compositionally biased segments in protein and
DNA sequences. BIAS automatically searches for unusual clusters
of user-specified residue types and computes analytical estimates of
the significance of each cluster. These estimates are based on scan
statistics that allows one to detect even subtle local deviations from
the random independence model. BIAS also distinguishes whether
or not observed clusters are significant due to local or global
compositional bias. Although the method is primarily intended
for the analysis of proteins, it can be applied to DNA sequences
as well. We use a number of biological examples to compare the
performance of BIAS with that of SEG, SAPS and CAST programs.
We also show that groups of proteins with the same biological
function contain compositionally biased segments of the same
type and that this association between type of function and type
of compositional bias is highly non-random.
SYSTEMS AND METHODS
Overview
BIAS addresses the following problem. Given a sequence, S, of length N
composed of residues generated using an alphabet, A, of size n according to
the random independence model and a subalphabet, B, of m residues selected
from A (m < n), find subsequences of S in which residues from B are
overrepresented. These compositionally biased subsequences (segments)
will correspond to clusters that have an unusually high density of residues
from B and are unlikely to be observed by pure chance under the random
independence model. In order to identify such clusters, S is represented as a
sequence of successes and failures in N independent Bernoulli trials. Successes correspond to residues from B, failures correspond to those residues
from A that are not included in B. Successes separated by less than d positions are merged into a single cluster. Statistical significance of each cluster
is estimated using discrete scan statistic. Discrete scan statistics, denoted Sw,
is the maximum number of successes observed within any w contiguous
trials in a sequence of N Bernoulli trials. The tests based on scan statistics use
a significance level that takes into account the sliding of the window to
maximize the density of successes within the cluster. The unconditional
discrete scan statistic is used in which the total number of successes in
N trials is treated as a random variable. The probability of success is
computed as the sum of probabilities of observing residue types from the
subalphabet B.
Formal description of BIAS
Let S ¼ s1, . . . , sN be a biological sequence, where each residue si belongs
to an alphabet A ¼ {a1, . . . , an}. Let B ¼ {b1, . . . , bm} be a subalphabet
of A (m < n). Let P( j) be the probability of occurrence of residue type j.
Recode S into a sequence of N independent identically distributed binomial
random variables x1, . . . , xN, where P(xi ¼ 1) ¼ 1 P(xi ¼ 0) ¼ p, according
to the following rule:
(
1ðsuccessÞ if si 2 B
xi ¼
p ¼ PðXi ¼ 1Þ ¼
0ðfailureÞ otherwise
m
X
Pðbj Þ:
ð1Þ
j¼1
Use the linkage distance, d, to merge all positions between and including
two successes xi and xj into a cluster, C(i, j), if the following two conditions
Detection of user-defined compositional bias
are satisfied:
Input amino acid sequence of length 20 (N=20)
ACTLVDDVEDSTLLMATHVK
(1) C(i, j) cannot be extended—subsequences xi d, . . . , xi 1 and
xj + 1, . . . , xj + d (i d 1 and j + d N) consist only of
failures or i ¼ 1 or j ¼ N (positions at sequence termini).
(2) Any two consecutive successes xf and xg from the cluster are
separated by less than d positions: for i f < g j, g f d.
For a given cluster C(i, j) compute the number of successes, k, and cluster
size, w ¼ j i + 1. Estimate the significance of the cluster by computing
probability of observing the value of unconditional discrete scan statistic,
Sw, greater or equal to k in a sequence of N trials with the probability of
success p, P(Sw k j N, p). This probability can be computed analytically
using the following highly accurate approximation (Glaz et al., 2001):
ðN/wvÞ
Q
‚
ð2Þ
PðSW k j N‚ pÞ ¼ 1 Q1 · 2
Q1
where v ¼ 2 if N is a multiple of w and v ¼ 1 otherwise. For 2 < k < N,
we have
Q1 ¼ ðFb ðk1; w‚ pÞÞ2 ðk 1Þ · bðk; w‚ pÞ · Fb ðk 2; w‚ pÞ
þ w · p · bðk; w‚ pÞ · Fb ðk 3; w 1‚pÞ
ð3Þ
Recode the sequence by assigning success,
1, to negatively charged residues (D or E) and
failure, 0, to the remaining 18 amino acid
Sequence of 20 trials with 4 successes and 16
failures. Probability of success: p0 = P(D) + P(E)
00000110110000000000
Merge successes separated by less than
d positions into a cluster (d=2 here)
Cluster of 4 successes in a
window of size 5 (S5 = 4)
00000110110000000000
Q3 ¼ ðFb ðk1; w‚pÞÞ3 A1 þ A2 þ A3 A4
bðk; w‚ pÞ ¼ Cwk · pk · ð1pÞwk
Cwk ¼
w!
k!ðw kÞ!
ð4Þ
ð5Þ
8
k
>
< X
bði; w‚pÞ if k ¼ 0‚1‚. . .‚ w
Fb ðk; w‚pÞ ¼
>
: i¼0
0 if k < 0
ð6Þ
A1 ¼ 2 · bðk; w‚pÞ · Fb ðk 1; w‚ pÞ · fðk 1Þ · Fb ðk 2; w‚ pÞ
w · p · Fb ðk 3; w 1‚pÞg
ð7Þ
Estimate significance of the observed
cluster of 4 successes in a window of size 5,
P(S5 ≥ 4|N=20,p0) using Eq.2
Report sequence segments that correspond to
clusters of successes with p-value below threshold
(0.05 by default). Here, a segment DDVED with
p-value=10-2 is reported.
A2 ¼ 0:5 · ðbðk; w‚ pÞÞ2 · fðk 1Þ · ðk 2Þ · Fb ðk 3; w‚ pÞ
2 · ðk 2Þ · w · p · Fb ðk 4; w 1‚pÞ
ð8Þ
Output masked sequence
ACTLVddvedSTLLMATHVK
þ w · ðw 1Þ · p2 · Fb ðk 5; w 2‚pÞg
k1
X
bð2k r; w‚ pÞ · ðFb ðr1; w‚ pÞÞ2
A3 ¼
Fig. 1. Flowchart that shows an application of BIAS to search for clusters of
negatively charged residues in a protein sequence.
ð9Þ
r¼1
A4 ¼
k1
X
bð2k r; w‚ pÞ · bðr; w‚ pÞ · fðr 1Þ · Fb ðr 2; w‚ pÞ
r¼2
ð10Þ
w · p · Fb ðr 3; w 1‚pÞg:
The significance of global compositional bias of sequence S is estimated
using the binomial distribution. Let h be the total number of occurrences
of residues from B in S. Then, for an excess of residues from B (if h N · p)
we estimate P(X h j N, p). For a lack of residues from B (if h < N · p)
we estimate P(X < h j N, p) by the following equations:
PðX h j N‚ pÞ ¼
N
X
bði; N‚pÞ PðX < h j N‚ pÞ ¼
i¼h
h1
X
bði; N‚ pÞ:
ð11Þ
i¼0
A flowchart that illustrates the application of the method to find clusters of
negatively charged residues in amino acid sequence is shown in Figure 1.
IMPLEMENTATION
The method is implemented as a standard ANSI C++ program.
The program (PBIAS for proteins and NBIAS for DNA) has a
command-line interface, reads input sequences in standard
FASTA format and is suitable for high-throughput sequence analysis pipelines. Source code and pre-compiled binaries for Windows
and Linux are freely available for non-commercial use at http://lcg.
rit.albany.edu/bias/. A Perl script, MPBIAS.PL, that implements a
CAST-like (Promponas et al., 2000) procedure for masking of
low-complexity segments in a protein sequence is also provided.
This script searches for subsequences in which one of the 20 amino
acid types is significantly overrepresented and then replaces all
positions in such subsequences with the ‘X’ character.
Probability of each of the 20 amino acid types is estimated using
its frequency in a non-redundant protein database. Two sets of
frequencies are used: SPROT50, derived from the SWISS-PROT
database (Bairoch and Apweiler, 2000) and PDB50, derived from
the Protein Databank proteins (Berman et al., 2000). Sequences in
both datasets were clustered at 50% sequence identity, meaning that
all sequence pairs within the same dataset have identity <50%.
Clustering was performed using the CD-HIT program (Li et al.,
2002). PDB50 frequencies serve as the background model derived
1057
I.B.Kuznetsov and S.Hwang
Table 1. Comparison of compositionally biased regions detected by PBIAS, SEG, SAPS and CAST in human prion protein, PDB id: 1QLX, Swiss-Prot id:
P04156
Functionally/structurally annotated region
PBIAS regions
SEG regions
SAPS regions
CAST regions
N-terminal signal peptide, residues 1–22
1–22a, p ¼ 2 · 102
a:LVFMWCI
240–252b, p ¼ 2 · 104
(232–252 for e:5)
a:LVFMWCI
26–108b, p ¼ 2 · 107
a:GPSTND
188–193b, p ¼ 6 · 104
N/D
N/D
N/D
237–252
240–253 (hydrophobic)
232–253 (transmembrane)
N/D
50–94
54–94 (identified as
simple tandem repeat)
N/D
29–94 (G-rich)
C-terminal signal peptide, residues 231–253
Flexible unstructured N-terminal
domain, residues 23–124
Fragment of helix B that unwinds upon prion
dimerization, residues 189–195
188–201
N/D
a:T
N/D, not detected.
a
PBIAS detects this region only if sequence-derived amino acid frequencies are used.
b
Segment remains significant if global compositional bias is accounted for.
from mostly globular proteins, whereas SPROT50 frequencies
serve as the background model derived from all proteins in the
protein universe. In order to remove the effect of global compositional bias, significance of each cluster is also estimated using
sequence-derived residue frequencies (default option for DNA
sequences). The program also has options for estimating the
significance of local compositional bias by using random sequences.
DISCUSSION
In this section we present the application of PBIAS to a number of
well-annotated proteins that contain domains with a different type
of chemical bias that determines specific properties of these
domains and to a set of compositionally biased proteins from the
Plasmodium genome. The performance of PBIAS is compared with
that of SEG (Wootton and Federhen, 1996), SAPS (Brendel et al.,
1992) and CAST (Promponas et al., 2000). For a reasonably fair
comparison, all programs were used with default arguments (linkage distance of 4, SPROT50 frequencies and p-value threshold of
0.05 for PBIAS). These arguments were shown to perform well on a
wide variety of compositionally biased proteins. Depending on the
nature of each protein site, PBIAS was used with a subalphabet
that contains amino acid types with similar chemical properties
suitable for the identification of this particular site: hydrophobic,
aliphatic, charged, etc. SAPS was used via web interface available
at http://www.ch.embnet.org/software/SAPS_form.html. CAST was
used via web interface available at http://biophysics.biol.uoa.gr/
cgi-bin/CAST/cast_cgi. We also show that NBIAS can be successfully used to mask a cDNA sequence coding for a compositionally
biased protein domain. Finally, we use BIAS to demonstrate a highly
significant non-random association between type of function and
type of compositionally biased segments observed in groups of
proteins with the same biological function.
Prion protein
The prion protein, PrP, is a rare infectious agent that causes transmissible spongiform encephalopathy. PrP can undergo a dramatic
transition from the normal mostly helical conformation to the pathogenic conformation rich in beta-sheet (Prusiner, 1998). Prion
1058
protein has been extensively characterized experimentally. It consists of a flexible unstructured N-terminal domain thought to be
involved in binding copper and a structured mostly helical
C-terminal domain. It has been shown that upon formation of
PrP dimer the switch region located in the C-terminus of helix B
unwinds (Knaus et al., 2001). PrP also contains N- and C-terminal
signal peptides which are removed during post-translational
modification (Lehmann et al., 1999). The results of application
of PBIAS, SEG, SAPS and CAST to human prion protein are
shown in Table 1 and briefly summarized below.
(1) PBIAS is the only program that accurately identifies the
N-terminal signal peptide as an unusually dense cluster of
hydrophobic residues. PBIAS, SEG and SAPS identify the
C-terminal signal peptide. In this case, SAPS is most accurate
with default parameters (residues 232–253), but it reports
the peptide as a transmembrane region. SAPS also reports
the same peptide as a hydrophobic region, 240–253 versus
240–252 in PBIAS. PBIAS accurately identifies signal peptide
residues 232–252 when linkage distance is increased from
the default value of 4 to 5.
(2) PBIAS accurately identifies most of the unstructured
N-terminal domain as a highly significant long cluster of
residues with high propensity for irregular (coil) conformation.
CAST also identifies a significant portion of this domain.
Both SEG and SAPS identify only about half of this domain.
Unlike PBIAS, SEG and SAPS do not provide any insights
into the properties of the residues in this domain.
(3) PBIAS and SEG identify the flexible C-terminal end of helix B.
PBIAS identifies it as a highly significant cluster of Threonine
residues and is more accurate than SEG. Neither SAPS nor
CAST report this segment.
It should be noted that all segments identified in PrP by PBIAS
are statistically significant when the global compositional bias of
the protein is accounted for. In other words, they all represent true
cases of local compositional bias. This is especially interesting
in the case of the unstructured N-terminal domain. Although PrP
has a significant excess of residues with high propensity for coil
Detection of user-defined compositional bias
parameters. PBIAS with default arguments and subalphabet of
aliphatic residues (lipid-soluble residues) identifies a slightly longer
segment that includes this domain. If linkage distance is changed
from the default value of 4 to 3, PBIAS identifies the same domain
as SAPS (Table 2). SEG identifies only a part of the transmembrane
domain. Both PBIAS and SEG find another compositionally biased
segment in the C-terminus. CAST does not report any segments.
Ribosomal protein L32E
Ribosomal protein L32E has two disordered regions, long (residues
1–94) and short (residues 237–340) (Klein et al., 2001). PBIAS is
most accurate at identifying a long region as an unusual cluster of
residues with high propensity for disordered conformation. Both
SEG and SAPS identify only a small fraction of this region, SAPS
identifying it as a negatively charged region (Table 2). CAST
reports the 2–113 segment as Glu-rich.
C-Myc I protein
Fig. 2. The three-dimensional structure of Ser-tRNA synthetase dimer
(PDBid: 1SRY). Extended non-globular domains involved in interactions
with cognate tRNA are shown in black.
conformation [Equation (11), p ¼ 3 · 103), the cluster of these
residues corresponding to the N-terminal domain is significant
(p ¼ 5 · 104) even when sequence-derived frequencies are used.
Ser-tRNA synthetase
The N-terminal domain of Ser-tRNA synthetase is a non-globular
long helical bundle that interacts with cognate tRNA (Fig. 2)
(Fujinaga et al., 1993). PBIAS accurately identifies this domain
as a highly significant (p ¼ 4 · 107) cluster of strong helix
formers (Table 2). Despite a large excess of strong helix formers
[Equation (11), p ¼ 7 · 106], this cluster remains significant
(p ¼ 8 · 103) when global bias is accounted for. SEG identifies
two short low-complexity segments that cover only 35% of
this domain. CAST identifies about two-third of this domain.
SAPS does not report any unusual segments.
Collagen
Collagen is an extracellular structural protein with three long nonglobular triple-helical domains rich in amino acids Glycine and
Proline that have unusual backbone conformational preferences.
These triple-helical domains are involved in the formation of
coiled–coil structure (Beck and Brodsky, 1998). PBIAS is most
accurate at identifying these domains as highly significant clusters
of Glycine and Proline (Table 2). These clusters remain significant
when global bias is accounted for. SEG identifies many short lowcomplexity regions in these domains. SAPS identifies the entire
275–897 segment as a highly repetitive region. CAST identifies
269–899 segment as Gly- and Pro-rich.
C-Myc I protein has an acidic domain (residues 226–245) and a
leucine zipper domain (residues 387–415) (Vriz et al., 1989).
PBIAS is most accurate at identifying these domains as clusters
of charged residues (Table 2), whereas SEG reports too many lowcomplexity regions. Both CAST and SAPS identify the acidic
domain but miss the leucine zipper.
Masking cDNA sequence of prion protein
NBIAS was used with default arguments (linkage distance 5,
sequence-derived frequencies, p-value threshold of 0.05) and
subalphabet GC to mask the cDNA of human prion protein. This
procedure resulted in masking almost entire N-terminal part (codons
2–105) which is significantly compositionally biased in the protein
sequence. SEG only masked codons 96–105. Application of NBIAS
to the cDNA sequence of chicken PrP that has very different nucleotide composition and shares only 35% sequence identity with
human PrP gives similar results (see Supplementary information).
Compositionally biased segments in PDB and
SWISS-PROT proteins
It has been shown before using SEG that low-complexity segments
are underrepresented in globular proteins available in the Protein
Databank (PDB) (Huntley and Golding, 2002). Application of
PBIAS to proteins from the PDB and SWISS-PROT confirms
this observation. As one can see from Figure 3, the percentage
of proteins with at least one compositionally biased segment that
has p-value < 103 is significantly lower in PDB proteins for all
tested subalphabets. The largest difference is observed for hydrophobic and hydrophilic segments that are underrepresented in PDB
by more than one order of magnitude. This is consistent with the
observation that sequences lacking periodicity of hydrophobic/
hydrophilic residues cannot form compact globular structure
(Dill, 1999; Silverman, 2005) and therefore are underrepresented
in mostly globular PDB proteins.
Cytochrome C biogenesis protein
Analysis of low-complexity segments in
Plasmodium falciparum proteins
Cytochrome C biogenesis protein is involved in the heme delivery
pathway for cytochrome c maturation (Stevens et al., 2004).
This protein contains a transmembrane domain (residues 67–88).
SAPS is most accurate at identifying this domain with default
In order to compare the performance of PBIAS with that of SEG
and CAST on a large sequence set we used 210 proteins from
chromosome 2 of P.falciparum (Gardner et al., 1998). We chose
this dataset because it represents a set of highly compositionally
1059
I.B.Kuznetsov and S.Hwang
Table 2. Comparison of compositionally biased regions detected by PBIAS, SEG, SAPS and CAST
Protein and annotated regions
PBIAS regions
SEG regions
SAPS regions
Seryl-tRNA synthetasea
N-terminal domain: 1–105
Collagen alpha 1 precursor,c
triple-helical domains:
269–405, 418–756, 787–901
1–100b, p ¼ 4 · 107
a:ARQEKM
a:GP
269–757b, p ¼ 3 · 1014
784–903b, p ¼ 3 · 1014
24–34, 74–99
N/D
275–897 (identified
as highly repetitive
region)
269–899 (G-rich)
270–398 (P-rich)
792–898 (P-rich)
Cytochrome Cd biogenesis protein,
transmembrane domain: 67–88
a:AVIL
67–94b, p¼ 2 · 103
(67–88 for e:3)
161–180e, p¼ 102
a:AGPSTQDEK
1–119e, p¼ 2 · 104
269–326, 349–400, 549–565,
601–622, 630–661, 676–694,
723–742, 787–811,
837–855, 878–899
67–84, 169–180
67–88 (identified as
transmembrane)
N/D
Ribosomal proteinf L32E
disordered regions:
1–94, 237–240
C-Myc I proteing
Acidic domain: 226–245
Leucine zipper: 387–415
4–15, 71–94, 211–224
a:KRDE
226–245b, p¼ 3 · 105
389–411e, p¼ 2 · 102
CAST regions
34–95 (E-rich)
71–96 (identified
as negatively
charged segment)
226–245 (identified as
charged segment)
52–73, 136–162,
219–243, 399–416
2–113 (E-rich)
59–341 (S-rich)
225–260 (E-rich)
N/D, not detected.
a
PDB id: 1SRY, Swiss-Prot id: P34945.
b
Segment remains significant if global compositional bias is accounted for.
c
SwissProt id: P20849.
d
SwissProt id: P52224.
e
Segment is not significant if global compositional bias is accounted for.
f
PDB id: 1JJ2X, Swiss-Prot id: P12736.
g
SwissProt id: P06171.
7%
50
6%
Percentage of sequences
SPROT50
PDB50
5%
4%
3%
2%
40
30
BIAS
SEG
CAST
20
10
1%
0
0%
GA
GP
LIVMF
GPDNST
RKNEDSTQ
DEKR
0
1
2
3
4
5+
Number of low-complexity regions per sequence
Fig. 3. The fraction of PDB and SWISS-PROT sequences that contain at
least one compositionally biased segment with p-value 103 composed
of residue types from the following subalphabets: GA, smallest highly
flexible amino acids; GP, amino acids with unusual backbone preferences;
LIVMF, major hydrophobic amino acids; RKNEDSTQ, major hydrophilic
amino acids; DEKR, charged amino acids; GPDNST, amino acids with high
preference for coil conformation.
Fig. 4. Percentage of low-complexity segments detected by PBIAS, SEG and
CAST in 210 proteins from chromosome 2 of P.falciparum. The frequency of
occurrence of sequences with 0, 1, 2, 3, 4 or 5 low-complexity segments
is shown. For PBIAS analysis was performed using MPBIAS.PL script with
default arguments: linkage distance is estimated from the frequencies of
individual amino acids (option k), significance threshold is corrected for
20 multiple comparisons and set to 0.0025 (option m: 0.0025).
biased eukaryotic proteins that was extensively used before to
compare CAST and SEG methods (Promponas et al., 2000).
Each of the three programs was used with the default arguments.
The distribution of the number of low-complexity segments
detected in this dataset by PBIAS, SEG and CAST is shown in
Figure 4. We find that performance of PBIAS is more similar to that
of SEG than CAST. For instance, PBIAS identifies at least one
low-complexity segment in 93% of the sequences, SEG in 90%,
whereas CAST only in 74%. Both PBIAS and SEG also report
significantly higher number of sequences that contain five or
more low-complexity segments than CAST.
We also studied how PBIAS masking of low-complexity
segments performs in database searching compared with masking
with SEG and CAST. We used each of the 210 P.falciparum
sequences masked by PBIAS, SEG and CAST as a query sequence
in BLASTP (Altschul et al., 1997) sequence similarity search.
BLASTP was run against the NCBI NR database with the
default arguments and masking option turned off. For each query
1060
Detection of user-defined compositional bias
Table 3. The average and standard deviation of the overlap between hit lists returned for comparisons PBIAS versus SEG and PBIAS versus CAST performed
using the database searching benchmark
BLASTP E-value ¼ 106
BLASTP E-value ¼ 1030
OSP
P (Eq. 12)
OSP
S (Eq. 12)
OCP
P (Eq. 13)
OCP
C (Eq. 13)
95.2% ± 13.0%
96.8% ± 12.2%
74.1% ± 36.0%
72.0% ± 37.5%
98.0% ± 10.0%
99.7% ± 2.3%
76.0% ± 33.1%
67.6% ± 37.9%
See text for details.
sequence, i, we computed four overlap indices normalized between
0 and 100%:
OSP
S ði‚EÞ ¼
OCP
C ði‚EÞ ¼
N SEG\PBIAS ði‚ EÞ
N SEG ði‚EÞ
N CAST\PBIAS ði‚EÞ
N CAST ði‚ EÞ
OSP
P ði‚ EÞ ¼
OCP
P ði‚EÞ ¼
N SEG\PBIAS ði‚EÞ
N PBIAS ði‚EÞ
ð12Þ
N CAST\PBIAS ði‚EÞ
‚
N PBIAS ði‚ EÞ
ð13Þ
where NX(i, E) is the total number of hits with BLAST E-value
below threshold E returned for query sequence i masked with
method X. NX\Y(i, E) is the number of hits with E-value below
threshold E that appear in both hit list for i masked with method X
and in hit list for i masked with method Y (size of the overlap
between the two lists). X,Y2{PBIAS, CAST, SEG}. If the value
of a given overlap index, OXY
X , is 100% it means that all hits in list X
are also included in list Y. Small values of the overlap index
mean that list X contains many hits that are not included in list
Y. We used two E-value thresholds, 106 and 1030.
The results of BLASTP benchmarking indicate that when
PBIAS is compared with SEG, identical hit lists are returned for
30% of the sequences if E-value is set to 106 and for 41.4% of
the sequences if E-value is set to 1030. When PBIAS is compared
with CAST, identical hit lists are returned for 31.4 and 39.5% of the
sequences, respectively. A more detailed analysis of the overlap
indices (Table 3) shows that both SEG- and CAST-masked query
sequences tend to return more hits than the PBIAS-masked ones
(as indicated by low average OC and OS overlap indices ranging
from 67.6 to 76.0%). Almost all hits returned for PBIAS-masked
sequences are included in SEG and CAST hit lists (as indicated by
high average OP overlap indices ranging from 95.2 to 99.7%).
Compositionally biased segments and protein function
An important question regarding compositionally biased segments
is whether they exhibit statistically significant associations with
protein function. We used functional annotation from the Gene
Ontology (GO) database (Ashburner et al., 2000) to study if
clusters of residues with particular chemical properties preferentially occur in proteins with specific types of function. The GO
database consists of three major categories (ontologies): cellular
component, biological process and molecular function. Each
ontology consists of a set of unique terms indexed by unique id
numbers. A protein is classified using zero or more terms from
each of the three ontologies. For instance, SWISS-PROT protein
P80385 is linked to GO term 4679 (Molecular function: AMPactivated protein kinase activity) and GO term 6950 (Biological
process: response to stress). We used the following procedure to
identify a significant association between biased segments from a
subalphabet C and a particular GO term i:
(1) Use PBIAS with default arguments and a given subalphabet,
C, to search each protein from SPROT50 dataset for biased
segments. For GO term i count the number of proteins, n(i),
that contain at least one biased segment with p-value below the
threshold of 103 and are linked to this term.
(2) Count the total number of proteins in SPROT50, M, that
contain at least one biased segment from subalphabet C.
Randomly sample, without replacement, M proteins from
SPROT50. Count the number of proteins in this random
sample associated with GO term i, r(i). Repeat the sampling
N times.
(3) Use random samples to compute for each r(i) its average, E(i),
and standard deviation, S(i). Use these to compute the Z-score
for GO term i, z(i):
nðiÞ EðiÞ
zðiÞ ¼
,
SðiÞ
where
PN
EðiÞ ¼
and
j¼1
rð jÞ
N
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PN
2
j¼1 ðrð jÞ EðiÞÞ
:
SðiÞ ¼
N1
ð14Þ
Random sampling was performed 105 times (N ¼ 105,
a number for which the Z-score converges up to one decimal
digit).
We used two non-overlapping subalphabets, RKENDSTQ
(hydrophilic amino acids) and LVIFM (major hydrophobic
amino acids), and a subalphabet that partially overlaps with the
hydrophilic one, GPTSDN (amino acids with high propensity for
coil conformation). Figure 5 shows all GO terms from ‘Molecular
function’ ontology that have the absolute value of z(i) greater than
6.0 for at least one of the three subalphabets. Since we cannot
assume that all distributions are normal, the threshold Z-score of
6.0 was chosen to ensure that the observed associations are significant even for skewed distributions. Inspection of Figure 5 leads
to the following conclusions:
(1) There are groups of GO terms associated with only one type
of biased segments (hydrophobic, hydrophilic or coil),
despite the fact that hydrophilic and coil subalphabets
overlap. Hydrophobic clusters are associated with receptor
activity, voltage-gated channel activity, transporter activity.
Hydrophilic clusters are associated with nuclease activity,
1061
I.B.Kuznetsov and S.Hwang
Fig. 5. GO functional categories from ‘Molecular function’ ontology that are significantly associated with clusters of residues with particular chemical
properties. The following subalphabets were used: RKNEDSTQ, major hydrophilic amino acids; LIVMF, major hydrophobic amino acids; GPDNST, amino
acids with high preference for coil conformation. The Z-score given by the height of the bar quantifies the strength of the association between a particular GO
functional category given by GO id (x-axis) and a particular type of residue clusters. The type of residue cluster is given by the shading pattern (white, black or
striped bars). A complete description of each GO term is given in Supplementary information.
transcription, splicing and translation. Clusters of coil residues
are associated with direct binding to nucleic acids, protein
binding, structural constituent of cuticle.
(2) Presence of hydrophobic clusters is negatively correlated
with the presence of hydrophilic clusters. There are, however,
three exceptions: GO:4872 (receptor activity), GO:5245
(voltage-gated calcium channel activity) and GO:5486
(t-SNARE activity), all being integral membrane proteins.
Very similar trends are observed in the other two ontologies,
‘Cellular component’ and ‘Biological process’. However, the number of significant associations in these two ontologies is very large
and cannot be presented in a compact form suitable for the limited
journal space. Detailed results for a variety of subalphabets and
all three ontologies will be presented elsewhere. The primary
point from Figure 5 is that we do observe that groups of proteins
with the same biological function contain the same type of compositionally biased segments and this association between type of
function and type of compositional bias is highly non-random. This
suggests that compositionally biased segments are not just ‘junk
sequences’ and can be used to construct potential function-specific
de novo protein signatures.
CONCLUSIONS
We presented BIAS, a method for the identification of compositionally biased segments composed of a user-specified set of
residue types. Main features of the method are as follows.
1062
(1) BIAS finds statistically significant clusters of user-specified
residue types that are unlikely to be observed under the
random independence model. Significance is estimated analytically using scan statistics that provides a highly accurate
correction for multiple tests.
(2) BIAS provides an exact estimate of global compositional
bias of the test sequence and can take into account global
compositional bias when computing analytical estimates of
the significance of local clusters.
(3) Both the main advantage and main shortcoming of the
proposed method is that it requires the user to supply a
subalphabet of residue types used to search for compositional bias. The advantage is that BIAS can be used to
search for compositionally biased segments composed of
amino acids with similar chemical properties such as
charge, hydrophobicity, size, etc. The disadvantage is that
BIAS will ignore biased regions composed of residue
types that are not included into the user-supplied subalphabet.
The application of BIAS to a number of test proteins has
shown that it is suitable for de novo characterization of functionally important sites. The application of BIAS to study proteins
from the SWISS-PROT database has also shown that groups of
proteins with the same biological function contain the same type
of compositionally biased segments and that this association
between type of function and type of compositional bias is
non-random.
Detection of user-defined compositional bias
ACKNOWLEDGEMENTS
The authors thank Dr Yaroslava Ruzankina for help and two
anonymous reviewers for their valuable comments.
Conflict of Interest: none declared.
REFERENCES
Alba,M.M. et al. (2002) Detecting cryptically simple protein sequences using the
SIMPLE algorithm. Bioinformatics, 8, 672–678.
Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res., 25, 3389–3402.
Ashburner,M. et al. (2000) Gene ontology: tool for the unification of biology. The
Gene Ontology consortium. Nat. Genet., 25, 25–29.
Bairoch,A. and Apweiler,R. (2000) The SWISS-PROT protein sequence database and
its supplement TrEMBL in 2000. Nucleic Acids Res., 28, 45–48.
Beck,K. and Brodsky,B. (1998) Supercoiled protein motifs: the collagen triple-helix
and the alpha-helical coiled coil. J. Struct. Biol., 122, 17–29.
Berezovsky,I.N. et al. (1999) Amino acid composition of protein termini are biased
in different manners. Protein Eng., 12, 23–30.
Berman,H.M. et al. (2000) The protein data bank. Nucleic Acids Res., 28, 235–242.
Brendel,V. and Karlin,S. (1989) Association of charge clusters with functional domains
of cellular transcription factors. Proc. Natl Acad. Sci. USA, 86, 5698–5702.
Brendel,V. et al. (1992) Methods and algorithms for statistical analysis of protein
sequences. Proc. Natl Acad. Sci. USA, 89, 2002–2006.
Dill,K.A. (1999) Polymer principles and protein folding. Protein Sci., 8, 1166–1180.
Fujinaga,M. et al. (1993) Refined crystal structure of the seryl-tRNA synthetase from
Thermus thermophilus at 2.5s resolution. J. Mol. Biol., 234, 222–233.
Gardner,M.J. et al. (1998) Chromosome 2 sequence of the human malaria parasite
Plasmodium falciparum. Science, 282, 1126–1132.
Glaz,J., Naus,J. and Wallenstein,S. (2001) Scan Statistics. Springer-Verlag, NY,
pp. 45–46.
Harrison,P.M. and Gerstein,M. (2003) A method to assess compositional bias in
biological sequences and its application to prion-like glutamine/asparagine-rich
domains in eukaryotic proteomes. Genome Biol., 4, R40.
Huntley,M. and Golding,G.B. (2000.) Evolution of simple sequence in proteins.
J. Mol. Evol, 51, 131–140.
Huntley,M.A. and Golding,G.B. (2002) Simple sequences are rare in the Protein Data
Bank. Proteins, 48, 134–140.
Karlin,S. and Burge,C. (1996) Trinucleotide repeats and long homopeptides in genes
and proteins associated with nervous system disease and development. Proc. Natl
Acad. Sci. USA, 93, 1560–1565.
Karlin,S. et al. (1990) Identification of significant sequence patterns in proteins.
Methods Enzymol., 183, 388–402.
Karlin,S. et al. (2002) Amino acid runs in eukaryotic proteomes and disease associations. Proc. Natl Acad. Sci. USA, 99, 333–338.
Karlin,S. et al. (2003) Genome comparisons and analysis. Curr. Opin. Struct. Biol., 13,
344–352.
Klein,D.J. et al. (2001) The kink-turn: a new RNA secondary structure motif.
EMBO J., 20, 4214–4221.
Knaus,K.J. et al. (2001) Crystal structure of the human prion protein reveals a
mechanism for oligomerization. Nat. Struct. Biol., 8, 770–774.
Kreil,D.P. and Ouzounis,C.A. (2003) Comparison of sequence masking algorithms
and the detection of biased protein sequence regions. Bioinformatics, 19,
1672–1681.
Lehmann,S. et al. (1999) Trafficking of the cellular isoform of the prion protein.
Biomed. Pharmacother., 53, 39–46.
Li,W. et al. (2002) Tolerating some redundancy significantly speeds up clustering of
large protein databases. Bioinformatics, 8, 77–82.
Nishizawa,M. and Nishizawa,K. (1998) Biased usages of arginines and lysines in
proteins are correlated with local-scale fluctuations of the G + C content of
DNA sequences. J. Mol. Evol., 47, 385–393.
Promponas,V.J. et al. (2000) CAST: an iterative algorithm for the complexity analysis
of sequence tracts. Bioinformatics, 16, 915–922.
Prusiner,S.B. (1998) Prions. Proc. Natl Acad. Sci. USA, 95, 13363–13383.
Silverman,B.D. (2005) Underlying hydrophobic sequence periodicity of protein
tertiary structure. J. Biomol. Struct. Dyn., 22, 411–423.
Singer,G.A. and Hickey,D.A. (2000) Nucleotide bias causes a genome wide bias in
the amino acid composition of proteins. Mol. Biol. Evol., 17, 1581–1588.
Stevens,J.M. et al. (2004) C-type cytochrome formation: chemical and biological
enigmas. Acc. Chem. Res., 37, 999–1007.
Vriz,S. et al. (1989) Differential expression of two Xenopus c-myc proto-oncogenes
during development. EMBO J., 8, 4091–4097.
Wootton,J.C. and Federhen,S. (1996) Analysis of compositionally biased regions in
sequence databases. Methods Enzymol., 266, 554–571.
1063