Download The standard procedure starts with a set of sequences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein–protein interaction wikipedia , lookup

Protein domain wikipedia , lookup

Protein design wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Protein structure prediction wikipedia , lookup

Structural alignment wikipedia , lookup

Homology modeling wikipedia , lookup

Transcript
SUPPLEMENTARY MATERIAL
Methodology for the development of profile entries
The source information outlines the way profiles are constructed for the PROSITE
database of SWISSPROT and indicates the steps Provalidator carries out automatically.
Online relevant information may be acquired by downloading and is also available for
more details in Generalized profile syntax for protein and nucleic acid sequence motifs
by P. Bucher. The value of Provalidator is that it has generated a series of instructions
that ease the process of profile creation through automatization and which allow the
direct comparison of profiles in all available databases.
A profile is a table of position-specific amino acid weights and gap costs. It is used to
calculate a similarity score for any alignment between a profile and a sequence, or parts
of a profile and a sequence. The raw scores are normalized to estimate the statistical
significance of a given score.
The profile structure is similar to but slightly more general than the one introduced by
Gribskov and co-workers (1987). A technical description of the profile structure and of
the corresponding motif search method is given in the file PROFILE.TXT included in
each PROSITE release. The most relevant issues are summarized below.
The generation of a profile requires a multiple sequence alignment as input and uses a
symbol comparison matrix (that is, a weight matrix) to convert residue frequency
distributions into weights. They attempt to characterize a protein family or domain over
its entire length. With a profile covering conserved as well as divergent sequence
regions, there is a chance to obtain a significant similarity scores. This possibility is
taken into account by the quality evaluation procedures. In order to be acceptable, a
profile must assign high similarity scores to true motif occurrences, as well as low
scores to false matches, in addition, it should correctly align those residues having
analogous functions or structural properties according to experimental data.
Explanation of the most important steps in profile construction
1. Choosing an initial set of trusted sequences
The construction of each new profile entry begins with a set of sequences, either total
proteins or local homology domains, which have been established to belong to the same
family. It is important not to include a sequence with a doubtful relationship to the
family under consideration since even a single inappropriate sequence can severely
degrade profile performance. Criteria that we stipulate for establishing the relationship
between the sequences of the starting set include the following:
 Highly significant sequence similarity between all members using pairwise
comparison techniques.
 Knowledge of a common functionality of the sequences in combination with a
reasonable degree of sequence similarity.
 A common three-dimensional fold of the sequences allowing reliable
superposition of the structures, if available.
 Provalidator can be manually fed with a number of sequences as done in this
case for the construction of the RND profile or can choose sequences by direct
retrieval from databases. More details can be found in Molina-Henares et al.
(2009).
1
2. Construction of a multiple alignment
Provalidator automatically clusters sequences with BLASTCLUST and selects a
sequence from each cluster. Then we carry out a multiple alignment, which is generated
by ClustalW. In most cases, upon analysis of the initial alignment, it becomes clear that
a number of divergent proteins are present so that manual refinement of the sequence
regions is necessary. If some of the sequences are very divergent, it has proven
advantageous to exclude these sequences from the initial alignment and add them later
once a multiple alignment has been generated.
3. Calculation of the sequence weights
Provalidator uses the PFMAKE program, which is part of PFTOOLS and available at
the Swiss Institute of Bioinformatics, to automatically generate the profile. Lüthy et al.
(1994) showed that the introduction of sequence weights improves the performance of
the resulting profiles. This effect is particularly pronounced if the initial set of trusted
sequences contains unique sequences therefore avoiding redundancy. The weighting of
the pre-aligned sequences is established using the algorithm developed by Sibbald and
Argos (1990). Briefly, this algorithm constructs random sequences from the repertoire
of the original sequences and tests which of the proteins in the set is most closely
related to the random sequence. After about 2000*N such trials, the number of hits that
each of the N sequences in the initial set has accumulated is counted. The weighting
factor is then derived from these counts.
4. Construction of a generalised profile from a weighted alignment
The generalised profile syntax been described before (Bucher et al., 1994). In abstract
terms, a generalized profile can be described as an alternating sequence of 'match' and
'insert' positions. The match positions correspond to residues which typically occur in
such a sequence, whereas the insert positions represent places where additional residues
may be optionally be inserted. Match and insert positions contain complementary sets of
numeric parameters called profile scores. The values assigned to these parameters are
often different at each position. While the mathematical structure of a profile is that of a
two-dimensional table of numbers, a profile may also be viewed as a degenerate
molecular sequence.
Provalidator automatically converts weighted multiple alignments into generalised
profiles by employing the PFMAKE program, which takes advantage of the advanced
features of the generalised profiles (Sibbald and Argos, 1990). In the default procedure,
Provalidator uses a 10*log10-scaled version of the BLOSUM45 comparison matrix
(Henikoff and Henikoff, 1992), applying symmetrical gap-opening and gap-closing
penalties of 1.05 each, and a gap-extension penalty of 0.21. If no special requirements
exist, we use limited gap-excision to exclude regions in the alignment that are present in
less than 50% of the sequences from the profile. These gap-excisions are compensated
for by lowering the insert penalties at the excision boundaries, depending on the amount
of excised residues.
Depending on the purpose of the profile (i.e. if it reflects a complete protein family or a
localised homology domain that is part of larger sequences) we can force it to favour
local or global alignment behaviour, or any intermediate thereof.
2
5. Estimating the statistical significance of profile matches
The function of a profile is to align itself to a real sequence and to assign a number to
such an alignment. This number is called similarity score or alignment score and serves
to evaluate the significance of a potential motif occurrence.
Like most similarity search techniques, a protein database search with a profile returns a
sorted list of potential matches ranked by a quality score. Because there is no statistical
theory that allows for direct computation of the probability of obtaining a certain score
by chance, one has to rely on empirical methods for significance estimation. Such
methods typically attempt to fit the parameters of a mathematical function to the score
distribution of chance matches found in real or random sequences. If random sequences
are used, it is important that the sequences are generated with a procedure that preserves
certain statistical properties of biological sequences known to have an influence on the
score distribution such as compositional bias and the actual length distribution.
The specific method Provalidator uses for significance tests is based on a regionally
shuffled version of SWISS-PROT sequences, while preserving the original length
distribution and amino acid composition in successive windows of 20 residues in length
(Pearson and Lipman, 1988). Each profile is compared to this random database to
produce a list of high-scoring profile matches sorted by score. The score distribution is
then analysed by plotting the logarithm of the number of observed matches above a
given score against the score itself. Such a plot typically shows an approximately linear
relationship between these two variables, which would be expected for an extreme value
distribution:
In this distribution the NDB is the number of residues in the database, while the
parameters a and b are estimated by linear regression analysis and used to calculate a
normalised score. The purpose of normalization instructions is to convert the raw score
into directly interpretable units.
Note that a and b are characteristic parameters of a profile which need to be reestimated whenever a profile is modified. The probability of finding a match with a
given score in a database of a given size can be computed from the normalised score by
subtracting the logarithm of the number of residues found in the database. This value is
referred to as P-value and used as significance estimate for high-scoring profile matches
in the real protein sequence database.
3
Normalization functions are required to preserve the ranking of scores for alternative
alignments between the same profile and the same sequence. However, since
normalization functions may depend on sequence parameters such as length and residue
composition, they will generally not preserve the order of scores for matches from
different versions of sequences for same genes encountered during a database search.
This value is referred to as P-value and used as significance estimate for high-scoring
profile matches in the real protein sequence database (Pearson and Lipman, 1988).
6. Database search
Provalidator automatically confronts a profile (in this case the RND profile) against
UNIPROT. The output is a list of normalized scores. It has been empirically determined
that a cut-off normalization value N  8.5 has discriminatory character
The function of a cut-off value is to a priori exclude of the majority of the less relevant
alignments from further consideration by a profile search algorithm. The fate of the
remaining alignments with similarity scores greater than or equal to the cut-off value
depends on a specific disjointness definition applied. An important aspect of a cut-off
value is that it gives a qualitative meaning to a profile. This is a prerequisite for
statistics on false positives and false negatives obtained in a database search, as
currently provided by PROSITE.
References
Bucher P., Bairoch A. Proc. (1994) A generalized profile xyntax for biomolecular
sequence motifs and its function in automatic sequence interpretation. In: ISMB-94, pp.
53-61, AAAI/MIT Press.
Gribskov M., McLachlan A.D., Eisenberg D. (1987) Profile analysis: Detection of
distantly related proteins. Proc. Natl. Acad. Sci. USA 84:4355-4358.
Henikoff S., Henikoff, J.G. (1992) Amino acid substitution matrices from protein
blocks. Proc. Natl. Acad. Sci. USA 89:10915-10919.
Lüthy R., Xenarios I., Bucher P. (1994) Improving the sensitivity of the sequence
profile method. Prot. Sci. 3:139-146.
Molina-Henares, A.J., Godoy, P., Duque, E., and Ramos, J.L. (2009) A general profile
for the MerR family of transcriptional regulators constructed using the semi-automated
Provalidator tool. Environmental Microbiology Reports DOI: 1111/j.1758-2229.
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence
comparison. Proc. Natl. Acad. Sci. USA 85:2444-2448.
Sibbald P, Argos P. J. (1990) Weighting aligned protein or nucleic acid sequences to
correct for unequivocal presentation. Mol. Biol. 216:813-818.
4