Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Endogenous retrovirus wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Metabolomics wikipedia , lookup
Non-coding DNA wikipedia , lookup
Community fingerprinting wikipedia , lookup
Genetic code wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Point mutation wikipedia , lookup
Unit 12: Bioinformatics . 12 4 Analysis methods We have seen how biological sequences have been annotated and stored in databases, and how their entries can be linked to allow cross-database searching. Let us now think about some of the analysis methods used to find patterns and regularities in data-sets. This topic guide will look, in particular, at methods that help to derive database annotations, especially those used to assign structural and functional information to uncharacterised sequences. On successful completion of this topic you will: •• understand the processes of computational biology (LO2). To achieve a Pass in this unit you need to show that you can: •• discuss computational processes for the collection and manipulation of data (2.1) •• discuss the need to extract only the information relevant to a specific biological question (2.2). 1 Unit 12: Bioinformatics 1 Computational methods for finding patterns and regularities in data Increasingly, bioinformaticians have to analyse ‘raw’ (i.e., unannotated) sequences to try to shed light on their biological roles and evolutionary relationships. Sequences can be analysed using a range of methods, at various levels of specificity. The methods often work by identifying ‘tell-tale’ sequence patterns and regularities. Let us now consider how the tell-tale traits that protein sequences share can be analysed at different levels of sophistication, and what we can learn from them. Key terms Solvent accessibility: A measure of the likelihood of components of a biomolecule, such as a protein, to be available to interact with the surrounding solvent. Algorithm: A series of instructions set out in a stepwise way to allow computers to perform calculations. Hydrophobicity: The propensity of a biochemical entity to avoid water. Hydrophilicity: The propensity of a biochemical entity to display an affinity for water. Individual sequences A simple sort of analysis is to identify patterns that typify localised functional or structural features (TM domains, solvent-accessible regions, etc.). One method for detecting such features is to create a graph using what is known as a ‘slidingwindow’ algorithm – this divides the sequence into chunks, and plots average values of a given amino-acid property for each chunk, providing a rapid visual overview of the entire sequence. Consider the property of hydrophobicity. We take a window the size of the feature being investigated (~20 amino acids for a membrane-spanning helix). From the N terminus, hydrophobicity values of residues that lie in the window are averaged; the window then moves stepwise through the sequence, one residue at a time, and the calculation is repeated until the C terminus is reached. Figure 12.4.1: Using a hydrophobicity profile to pinpoint likely TM domains. Window (property) score The results are plotted in a 2D graph in which the query sequence lies on the x axis, and the hydrophobicity score along the y axis. The graph is characterised by peaks and troughs (see Figure 12.4.1) corresponding to the most hydrophobic and most hydrophilic parts of the sequence. Significance thresholds are often used to pinpoint possible TM domains: in Figure 12.4.1, the unbroken and broken lines mark the upper and lower significance bounds. From the figure, we can infer that the most hydrophobic peaks (those above the upper threshold) are likely to correspond to TM domains. 6 5 4 3 2 1 0 12.4: Analysis methods 0 50 100 150 200 250 300 350 400 Query sequence 2 Unit 12: Bioinformatics Activity Link Find out more about the properties of amino acids in Unit 1: Biochemistry of macromolecules and metabolic pathways. In the hydrophobicity profile shown in Figure 12.4.1, how many potential TM domains have been pinpointed? Give your reasoning. Pairs of sequences We can visualise the similarity shared by two sequences using what is known as a dotplot. This is a table where the sequences lie on the horizontal and vertical axes, and the table cells contain the scores of all their pairwise residue comparisons. Consider the first row of the table shown in Figure 12.4.2. If we compare the first residue of the vertical sequence to every residue in the horizontal sequence, there is only one position where the residues are the same – the first. An ‘x’ is therefore marked at this spot. This comparison process is repeated for every residue of the vertical sequence, giving rise to a table in which each identical residue match is marked with an x. Key terms Unitary matrix: Also known as an identity matrix, a table that records a score of 1 for all amino acid (or nucleotide) self-substitutions and 0 for all non-self-substitutions. Substitution matrix: A table that encodes the evolutionary exchange rates of each of the amino acid residues for each other in the form of a set of pairwise probability scores. Figure 12.4.2: Pairwise alignment of two short sequences of unequal length visualised using a dotplot. Using a unitary matrix, the alignment score is 17. 12.4: Analysis methods If identical matches score 1 and non-identities zero, we are using a ‘unitary matrix’ (to quantify similarity more sensitively, we use substitution matrices, which also give positive scores to exchanges between amino acids with similar properties). For identical sequences, dotplots show an unbroken diagonal line; for similar sequences, the main diagonal is broken (the blue line in Figure 12.4.2), and the interrupted regions pinpoint residue differences. Off-diagonal diagonals denote the presence of internal sequence repeats (the grey lines in Figure 12.4.2). The lengths of the principal and off-diagonal diagonals denote the span of identical residues that the two sequences share. Another way to compare two sequences is to align them, one below the other. If they differ in length, ‘gap’ characters must be introduced to bring them into vertical register. Figure 12.4.2 illustrates an alignment of two short sequences, scored with a unitary matrix and visualised with a dotplot. CTYFPHF–ELGHGSAQRGHG CTYFPHFSDLGHGSAQKGHG = 17 C T Y F P H F E L G H G S A Q R G H G C T Y F P H F S D L G H G S A Q K G H G 3 Unit 12: Bioinformatics Dynamic programming: A computer programming method that breaks down large, usually very difficult, problems into smaller sub-problems that can be solved, and combines the results to give an overall solution. Heuristic algorithm: An algorithm that approximates the solution to a complex problem by solving a simpler problem, producing a good, but not necessarily the best, solution. p-value: In a database search, the probability of there being a match with a score greater than or equal to that of a retrieved match, relative to the scores expected when comparing random sequences of the same length and composition as the query to the database. e-value: In a database search, the number of matches with scores greater than or equal to that of the retrieved match that are expected to occur by chance in a database of the same size and composition, using the same scoring system. Motif: An un-gapped, conserved region of an alignment, typically 10–20 amino acids in length, often denoting a key structural or functional feature. Link Find out more about the concept of probability in Unit 10: Statistics for experimental design. Activity Examine the dotplot illustrated in Figure 12.4.2. •• What are the lengths of the two sequences? •• Why is their alignment score 17? •• What is the length and what is the amino acid sequence of the internal repeat? The best alignment between two sequences can be calculated using ‘dynamic programming’ methods. As the time taken to do this is proportional to the product of the sequence lengths, these methods do not scale for large numbers of sequences, such as those found in databases. Instead, approximate or ‘heuristic’ algorithms are used to find good (but not necessarily the best) alignments in reasonable time frames. In this way, pairwise algorithms have been optimised for database searching, and provide statistical estimates (p-values and e-values) to give measures of confidence in their results: the p-value gives a probability that the retrieved match is significant, while the e-value denotes the number of random matches expected with equal or better scores than that of the retrieved match. The closer these values are to 0, the more significant the matches are. Multiple sequences Aligning multiple sequences is, arguably, more useful than just looking at pairs of sequences because in this context conserved traits can be seen. Various alignment-based methods have been developed to help identify conserved functional and structural traits (binding sites, active sites, domains, etc.), and to determine their biological significance. The methods vary both in how they use patterns of similarity to identify such features, and in their diagnostic performance. As illustrated in Figure 12.4.3, there are three main approaches, depending on whether the methods use: •• single motifs (for example, encoded as consensus expressions) •• multiple motifs (for example, encoded as fingerprints), or •• complete domains (for example, encoded as profiles). Single-motif methods (e.g. PROSITE) Domain-based methods (e.g. Pfam) Key terms Figure 12.4.3: Alignment-based pattern-recognition methods. These methods fall into three main classes, depending on whether they use single motifs, multiple motifs or full domains. Multiple-motif methods (e.g. PRINTS) 12.4: Analysis methods 4 Unit 12: Bioinformatics i) Consensus expressions are the simplest, because they distil the residue conservation information from regions of the full alignment into a single consensus sequence or ‘pattern’. Here, residues at each position of a single motif are examined. If a position is conserved (i.e., has the same residue in all sequences in the alignment), that residue is noted; if several similar residues occur, these are grouped; if residues share no similarity, an x is given. These observations are used to create a consensus expression summarising the residue conservation at each motif position: for example, D-[IVL]-V-[FY]-x-[KRH]-Q. This expression shows: •• three conserved residues (D,V,Q) •• three positions where residues share similar properties ([IVL], [FY], [KRH]) •• one non-conserved position (x). Expressions like this are used to search databases for sequences that share the same pattern of residue conservation; they work best in diagnosing small functional sites, and are the basis of the PROSITE database. Activity Examine the consensus expression described above. •• Consider the sequences DVVYAKQ and DIVFGRN. Do the sequences match the expression? Explain your reasoning. ii) Fingerprints combine groups of motifs in sets of aligned sequences to create more powerful diagnostic signatures – the more motifs they include, the more specific fingerprints become. Residue information within motifs is scored with a simple unitary matrix. Fingerprints are used to search databases for sequences that contain the same sets of conserved motifs; they work best at diagnosing members of closely related families and subfamilies, and are the basis of the PRINTS database. iii) Profiles are more complex, because they encapsulate complete domains (i.e., including regions with insertions and deletions) and use substitution matrices to score amino acid comparisons. Each alignment position is scored according to whether it is conserved, or whether a residue has been inserted or deleted (large insertions and deletions are penalised with negative scores). Profiles work best at diagnosing divergent members of superfamilies; they were introduced into PROSITE to offer diagnostic alternatives for some of its weaker consensus expressions. Hidden Markov models (HMMs) are like profiles but use probabilities to denote whether residue positions are conserved, or whether the positions include insertions or deletions. Like profiles, HMMs are best at diagnosing members of superfamilies; they are the basis of the Pfam database. Take it further Find out more about sequence analysis methods and the need for critical analysis in Introduction to Bioinformatics (Attwood and Parry-Smith, 1999, Prentice Hall) and in Bioinformatics and Molecular Evolution (Higgs and Attwood, 2005, Wiley-Blackwell). 12.4: Analysis methods 5 Unit 12: Bioinformatics 2 Methods for determining accuracy and statistical significance The methods described in the previous section aim to identify functionally or structurally related sequences. The challenge is to determine: i which of the sequences matched are truly related (true-positive) ii which sequences are unrelated (true-negative). Figure 12.4.4: Resolving true from false matches. Number of matches The results of database searches often contain a mixture of true and false matches, at a given scoring threshold (St). This happens when the maximum score of true negatives (Stn) exceeds the minimum score of true positives (Stp), creating an overlap region in which some unrelated sequences have been matched in error (false-positives), and some truly related sequences have been missed (falsenegatives). Figure 12.4.4 illustrates how difficult it is to differentiate correct from false matches when the scores of true-positive and -negative sequences overlap. True negatives Threshold, St True positives Stp False negatives Stn False positives Score Analysis methods aim to reduce or eliminate the overlap between the truepositive and true-negative distributions, hence reducing or eliminating the number of false results and improving diagnostic performance. The best diagnostic performance is achieved when all true family members are matched, with no false matches or missed sequences. Various statistical measures are used to evaluate the performance of analysis methods like this, including: •• sensitivity (or recall), the proportion of the true data-set retrieved •• specificity (or precision), the proportion of all retrieved matches that belong to the true data-set •• accuracy, the proportion of all the results that are true (true-positive and true-negative). Link Activity Find out more about statistical tests in Unit 3: Analysis of scientific data and information. Examine the distribution of scores illustrated in Figure 12.4.4. •• Can you suggest a score at which a statistically significant result might not be biologically relevant? Give your reasoning. 12.4: Analysis methods •• 6 Unit 12: Bioinformatics Determining the statistical significance of results returned by such analysis methods will get harder as the onslaught from next- and third-generation sequencing technologies gains pace. Moreover, determining statistical significance does not guarantee biological significance; rigorous checks should therefore always be applied. This is especially true in the field of sequence analysis and annotation, where errors can easily slip into databases, and then propagate to other resources that use their data. Today, annotating sequences with functional, structural and evolutionary information is still a non-trivial computational task, and therefore requires machine automation to be appropriately balanced with human supervision. Take it further Find out more about bioinformatics analysis methods and statistics in Bioinformatics and Molecular Evolution (Higgs and Attwood, 2005, Wiley-Blackwell). Protein sequence analyst Toni works at the University of Manchester as a protein sequence analyst for the PRINTS protein fingerprint database. Her work involves running several sequence analysis tools. First, she aligns an initial set of amino acid sequences, which are believed to represent a given protein family (for example, the prion proteins: http://www.bioinf.manchester.ac.uk/cgi-bin/dbbrowser/ALIGN/ PRINTShtmlalign.cgi?align=PRION); she examines the alignment to discover the most conserved regions or motifs. Next, she scans the UniProt database to try to discover additional sequences that share all of the pinpointed motifs. The challenge she then faces is to discern from the results of these scans which of the matched sequences are genuine members of the protein family. At a given scoring threshold, she will find many cases where the matched sequences have high scores but are not related to the family of interest. Her task is therefore to find the ‘best’ score, where the number of wrongly matched sequences and the number of missed true family members are minimised, and the number of correctly matched true family members is maximised. When confident that she has found the best score to minimise the errors, she can compile her results into a PRINTS template, which she must then annotate with a range of biological and technical information before depositing the entry into the database: http://www.bioinf.manchester.ac.uk/cgi-bin/dbbrowser/ PRINTS/DoPRINTS.pl?cmd_a=Display&qua_a=/Full&fun_a=Code&qst_a=PRION. Note that in the PRINTS entry for the prion family, Toni has identified a sequence that matches only three of the defining eight motifs; however, she was confident that this was a true family member (although only a partial match to the fingerprint), because the UniProt annotation indicates that the sequence is a relative of the prion family from the chicken. Further reading Attwood, T. and Parry-Smith, D. (1999) Introduction to Bioinformatics, Prentice Hall. Contains more information about sequence analysis methods and the need for critical analysis. Refer to Chapter 6 for further information on pairwise alignments and dotplots, and to Chapter 8 for background on alignment-based pattern-recognition methods. Higgs, P. and Attwood, T. (2005) Bioinformatics and Molecular Evolution, Wiley-Blackwell. Also contains further information about bioinformatics analysis methods and statistics – refer in particular to Chapters 7 and 9. Zvelebil, M. and Baum, J. (2007) Understanding Bioinformatics, Garland Science. Chapter 6 offers further insights into pattern-recognition methods. 12.4: Analysis methods 7 Unit 12: Bioinformatics Checklist At the end of this topic guide you should be familiar with the following ideas about bioinformatics: many different methods are used to identify patterns in biological data in the field of sequence analysis, different methods have been developed to identify biological characteristics in individual sequences, in pairs of sequences and in sets of multiple sequences alignment of multiple sequences is computationally demanding, and requires the use of approximate algorithms pairwise alignment methods have been optimised for searching multiple sequences in databases multiple alignment-based methods to identify conserved functional and structural traits in protein sequences typically use single motifs (consensus expressions), multiple motifs (fingerprints) or domains (profiles, HMMs) p-values and e-values are used to indicate the significance of matches retrieved from database searches specificity (precision), sensitivity (recall) and accuracy are statistical measures used to evaluate the diagnostic performance of analysis methods statistically significant matches need not be biologically significant results of automatic analyses should be rigorously checked, preferably by humans. Acknowledgements The publisher would like to thank the following for their kind permission to reproduce their photographs: PhotoDisc: Lawrence Lawry All other images © Pearson Education Every effort has been made to trace the copyright holders and we apologise in advance for any unintentional omissions. We would be pleased to insert the appropriate acknowledgement in any subsequent edition of this publication. 12.4: Analysis methods 8