Download The standard procedure starts with a set of sequences

SUPPLEMENTARY MATERIAL Methodology for the development of profile entries The source information outlines the way profiles are constructed for the PROSITE database of SWISSPROT and indicates the steps Provalidator carries out automatically. Online relevant information may be acquired by downloading and is also available for more details in Generalized profile syntax for protein and nucleic acid sequence motifs by P. Bucher. The value of Provalidator is that it has generated a series of instructions that ease the process of profile creation through automatization and which allow the direct comparison of profiles in all available databases. A profile is a table of position-specific amino acid weights and gap costs. It is used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. The raw scores are normalized to estimate the statistical significance of a given score. The profile structure is similar to but slightly more general than the one introduced by Gribskov and co-workers (1987). A technical description of the profile structure and of the corresponding motif search method is given in the file PROFILE.TXT included in each PROSITE release. The most relevant issues are summarized below. The generation of a profile requires a multiple sequence alignment as input and uses a symbol comparison matrix (that is, a weight matrix) to convert residue frequency distributions into weights. They attempt to characterize a protein family or domain over its entire length. With a profile covering conserved as well as divergent sequence regions, there is a chance to obtain a significant similarity scores. This possibility is taken into account by the quality evaluation procedures. In order to be acceptable, a profile must assign high similarity scores to true motif occurrences, as well as low scores to false matches, in addition, it should correctly align those residues having analogous functions or structural properties according to experimental data. Explanation of the most important steps in profile construction 1. Choosing an initial set of trusted sequences The construction of each new profile entry begins with a set of sequences, either total proteins or local homology domains, which have been established to belong to the same family. It is important not to include a sequence with a doubtful relationship to the family under consideration since even a single inappropriate sequence can severely degrade profile performance. Criteria that we stipulate for establishing the relationship between the sequences of the starting set include the following:  Highly significant sequence similarity between all members using pairwise comparison techniques.  Knowledge of a common functionality of the sequences in combination with a reasonable degree of sequence similarity.  A common three-dimensional fold of the sequences allowing reliable superposition of the structures, if available.  Provalidator can be manually fed with a number of sequences as done in this case for the construction of the RND profile or can choose sequences by direct retrieval from databases. More details can be found in Molina-Henares et al. (2009). 1 2. Construction of a multiple alignment Provalidator automatically clusters sequences with BLASTCLUST and selects a sequence from each cluster. Then we carry out a multiple alignment, which is generated by ClustalW. In most cases, upon analysis of the initial alignment, it becomes clear that a number of divergent proteins are present so that manual refinement of the sequence regions is necessary. If some of the sequences are very divergent, it has proven advantageous to exclude these sequences from the initial alignment and add them later once a multiple alignment has been generated. 3. Calculation of the sequence weights Provalidator uses the PFMAKE program, which is part of PFTOOLS and available at the Swiss Institute of Bioinformatics, to automatically generate the profile. Lüthy et al. (1994) showed that the introduction of sequence weights improves the performance of the resulting profiles. This effect is particularly pronounced if the initial set of trusted sequences contains unique sequences therefore avoiding redundancy. The weighting of the pre-aligned sequences is established using the algorithm developed by Sibbald and Argos (1990). Briefly, this algorithm constructs random sequences from the repertoire of the original sequences and tests which of the proteins in the set is most closely related to the random sequence. After about 2000*N such trials, the number of hits that each of the N sequences in the initial set has accumulated is counted. The weighting factor is then derived from these counts. 4. Construction of a generalised profile from a weighted alignment The generalised profile syntax been described before (Bucher et al., 1994). In abstract terms, a generalized profile can be described as an alternating sequence of 'match' and 'insert' positions. The match positions correspond to residues which typically occur in such a sequence, whereas the insert positions represent places where additional residues may be optionally be inserted. Match and insert positions contain complementary sets of numeric parameters called profile scores. The values assigned to these parameters are often different at each position. While the mathematical structure of a profile is that of a two-dimensional table of numbers, a profile may also be viewed as a degenerate molecular sequence. Provalidator automatically converts weighted multiple alignments into generalised profiles by employing the PFMAKE program, which takes advantage of the advanced features of the generalised profiles (Sibbald and Argos, 1990). In the default procedure, Provalidator uses a 10*log10-scaled version of the BLOSUM45 comparison matrix (Henikoff and Henikoff, 1992), applying symmetrical gap-opening and gap-closing penalties of 1.05 each, and a gap-extension penalty of 0.21. If no special requirements exist, we use limited gap-excision to exclude regions in the alignment that are present in less than 50% of the sequences from the profile. These gap-excisions are compensated for by lowering the insert penalties at the excision boundaries, depending on the amount of excised residues. Depending on the purpose of the profile (i.e. if it reflects a complete protein family or a localised homology domain that is part of larger sequences) we can force it to favour local or global alignment behaviour, or any intermediate thereof. 2 5. Estimating the statistical significance of profile matches The function of a profile is to align itself to a real sequence and to assign a number to such an alignment. This number is called similarity score or alignment score and serves to evaluate the significance of a potential motif occurrence. Like most similarity search techniques, a protein database search with a profile returns a sorted list of potential matches ranked by a quality score. Because there is no statistical theory that allows for direct computation of the probability of obtaining a certain score by chance, one has to rely on empirical methods for significance estimation. Such methods typically attempt to fit the parameters of a mathematical function to the score distribution of chance matches found in real or random sequences. If random sequences are used, it is important that the sequences are generated with a procedure that preserves certain statistical properties of biological sequences known to have an influence on the score distribution such as compositional bias and the actual length distribution. The specific method Provalidator uses for significance tests is based on a regionally shuffled version of SWISS-PROT sequences, while preserving the original length distribution and amino acid composition in successive windows of 20 residues in length (Pearson and Lipman, 1988). Each profile is compared to this random database to produce a list of high-scoring profile matches sorted by score. The score distribution is then analysed by plotting the logarithm of the number of observed matches above a given score against the score itself. Such a plot typically shows an approximately linear relationship between these two variables, which would be expected for an extreme value distribution: In this distribution the NDB is the number of residues in the database, while the parameters a and b are estimated by linear regression analysis and used to calculate a normalised score. The purpose of normalization instructions is to convert the raw score into directly interpretable units. Note that a and b are characteristic parameters of a profile which need to be reestimated whenever a profile is modified. The probability of finding a match with a given score in a database of a given size can be computed from the normalised score by subtracting the logarithm of the number of residues found in the database. This value is referred to as P-value and used as significance estimate for high-scoring profile matches in the real protein sequence database. 3 Normalization functions are required to preserve the ranking of scores for alternative alignments between the same profile and the same sequence. However, since normalization functions may depend on sequence parameters such as length and residue composition, they will generally not preserve the order of scores for matches from different versions of sequences for same genes encountered during a database search. This value is referred to as P-value and used as significance estimate for high-scoring profile matches in the real protein sequence database (Pearson and Lipman, 1988). 6. Database search Provalidator automatically confronts a profile (in this case the RND profile) against UNIPROT. The output is a list of normalized scores. It has been empirically determined that a cut-off normalization value N  8.5 has discriminatory character The function of a cut-off value is to a priori exclude of the majority of the less relevant alignments from further consideration by a profile search algorithm. The fate of the remaining alignments with similarity scores greater than or equal to the cut-off value depends on a specific disjointness definition applied. An important aspect of a cut-off value is that it gives a qualitative meaning to a profile. This is a prerequisite for statistics on false positives and false negatives obtained in a database search, as currently provided by PROSITE. References Bucher P., Bairoch A. Proc. (1994) A generalized profile xyntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In: ISMB-94, pp. 53-61, AAAI/MIT Press. Gribskov M., McLachlan A.D., Eisenberg D. (1987) Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 84:4355-4358. Henikoff S., Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89:10915-10919. Lüthy R., Xenarios I., Bucher P. (1994) Improving the sensitivity of the sequence profile method. Prot. Sci. 3:139-146. Molina-Henares, A.J., Godoy, P., Duque, E., and Ramos, J.L. (2009) A general profile for the MerR family of transcriptional regulators constructed using the semi-automated Provalidator tool. Environmental Microbiology Reports DOI: 1111/j.1758-2229. Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85:2444-2448. Sibbald P, Argos P. J. (1990) Weighting aligned protein or nucleic acid sequences to correct for unequivocal presentation. Mol. Biol. 216:813-818. 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The standard procedure starts with a set of sequences