Download Topic guide 12.4: Analysis methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Endogenous retrovirus wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Protein–protein interaction wikipedia , lookup

RNA-Seq wikipedia , lookup

Metabolomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Community fingerprinting wikipedia , lookup

Genetic code wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Point mutation wikipedia , lookup

Protein structure prediction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Unit 12: Bioinformatics
.
12 4
Analysis methods
We have seen how biological sequences have been annotated and stored
in databases, and how their entries can be linked to allow cross-database
searching. Let us now think about some of the analysis methods used to find
patterns and regularities in data-sets.
This topic guide will look, in particular, at methods that help to derive
database annotations, especially those used to assign structural and functional
information to uncharacterised sequences.
On successful completion of this topic you will:
•• understand the processes of computational biology (LO2).
To achieve a Pass in this unit you need to show that you can:
•• discuss computational processes for the collection and manipulation
of data (2.1)
•• discuss the need to extract only the information relevant to a specific
biological question (2.2).
1
Unit 12: Bioinformatics
1 Computational methods for finding patterns
and regularities in data
Increasingly, bioinformaticians have to analyse ‘raw’ (i.e., unannotated) sequences
to try to shed light on their biological roles and evolutionary relationships.
Sequences can be analysed using a range of methods, at various levels of
specificity. The methods often work by identifying ‘tell-tale’ sequence patterns
and regularities.
Let us now consider how the tell-tale traits that protein sequences share can be
analysed at different levels of sophistication, and what we can learn from them.
Key terms
Solvent accessibility: A measure
of the likelihood of components of
a biomolecule, such as a protein,
to be available to interact with the
surrounding solvent.
Algorithm: A series of instructions
set out in a stepwise way to allow
computers to perform calculations.
Hydrophobicity: The propensity of a
biochemical entity to avoid water.
Hydrophilicity: The propensity of
a biochemical entity to display an
affinity for water.
Individual sequences
A simple sort of analysis is to identify patterns that typify localised functional or
structural features (TM domains, solvent-accessible regions, etc.). One method
for detecting such features is to create a graph using what is known as a ‘slidingwindow’ algorithm – this divides the sequence into chunks, and plots average
values of a given amino-acid property for each chunk, providing a rapid visual
overview of the entire sequence.
Consider the property of hydrophobicity. We take a window the size of the feature
being investigated (~20 amino acids for a membrane-spanning helix). From the
N terminus, hydrophobicity values of residues that lie in the window are averaged;
the window then moves stepwise through the sequence, one residue at a time,
and the calculation is repeated until the C terminus is reached.
Figure 12.4.1: Using a hydrophobicity
profile to pinpoint likely TM domains.
Window (property) score
The results are plotted in a 2D graph in which the query sequence lies on the
x axis, and the hydrophobicity score along the y axis. The graph is characterised by
peaks and troughs (see Figure 12.4.1) corresponding to the most hydrophobic
and most hydrophilic parts of the sequence. Significance thresholds are often
used to pinpoint possible TM domains: in Figure 12.4.1, the unbroken and broken
lines mark the upper and lower significance bounds. From the figure, we can infer
that the most hydrophobic peaks (those above the upper threshold) are likely to
correspond to TM domains.
6
5
4
3
2
1
0
12.4: Analysis methods
0
50
100
150
200
250
300
350
400
Query sequence
2
Unit 12: Bioinformatics
Activity
Link
Find out more about the properties of
amino acids in Unit 1: Biochemistry
of macromolecules and metabolic
pathways.
In the hydrophobicity profile shown in Figure 12.4.1, how many potential TM domains have
been pinpointed? Give your reasoning.
Pairs of sequences
We can visualise the similarity shared by two sequences using what is known as a
dotplot. This is a table where the sequences lie on the horizontal and vertical axes,
and the table cells contain the scores of all their pairwise residue comparisons.
Consider the first row of the table shown in Figure 12.4.2. If we compare the first
residue of the vertical sequence to every residue in the horizontal sequence, there
is only one position where the residues are the same – the first. An ‘x’ is therefore
marked at this spot. This comparison process is repeated for every residue of the
vertical sequence, giving rise to a table in which each identical residue match is
marked with an x.
Key terms
Unitary matrix: Also known as an
identity matrix, a table that records
a score of 1 for all amino acid (or
nucleotide) self-substitutions and 0
for all non-self-substitutions.
Substitution matrix: A table that
encodes the evolutionary exchange
rates of each of the amino acid
residues for each other in the form of
a set of pairwise probability scores.
Figure 12.4.2: Pairwise alignment of
two short sequences of unequal length
visualised using a dotplot. Using a unitary
matrix, the alignment score is 17.
12.4: Analysis methods
If identical matches score 1 and non-identities zero, we are using a ‘unitary matrix’
(to quantify similarity more sensitively, we use substitution matrices, which also
give positive scores to exchanges between amino acids with similar properties).
For identical sequences, dotplots show an unbroken diagonal line; for similar
sequences, the main diagonal is broken (the blue line in Figure 12.4.2), and the
interrupted regions pinpoint residue differences. Off-diagonal diagonals denote
the presence of internal sequence repeats (the grey lines in Figure 12.4.2). The
lengths of the principal and off-diagonal diagonals denote the span of identical
residues that the two sequences share.
Another way to compare two sequences is to align them, one below the other.
If they differ in length, ‘gap’ characters must be introduced to bring them into
vertical register. Figure 12.4.2 illustrates an alignment of two short sequences,
scored with a unitary matrix and visualised with a dotplot.
CTYFPHF–ELGHGSAQRGHG
CTYFPHFSDLGHGSAQKGHG
= 17
C T Y F P H F E L G H G S A Q R G H G
C
T
Y
F
P
H
F
S
D
L
G
H
G
S
A
Q
K
G
H
G
3
Unit 12: Bioinformatics
Dynamic programming: A
computer programming method
that breaks down large, usually
very difficult, problems into smaller
sub-problems that can be solved,
and combines the results to give an
overall solution.
Heuristic algorithm: An algorithm
that approximates the solution to
a complex problem by solving a
simpler problem, producing a good,
but not necessarily the best, solution.
p-value: In a database search, the
probability of there being a match
with a score greater than or equal to
that of a retrieved match, relative to
the scores expected when comparing
random sequences of the same
length and composition as the query
to the database.
e-value: In a database search, the
number of matches with scores
greater than or equal to that of the
retrieved match that are expected to
occur by chance in a database of the
same size and composition, using the
same scoring system.
Motif: An un-gapped, conserved
region of an alignment, typically
10–20 amino acids in length,
often denoting a key structural or
functional feature.
Link
Find out more about the concept of
probability in Unit 10: Statistics for
experimental design.
Activity
Examine the dotplot illustrated in Figure 12.4.2.
•• What are the lengths of the two sequences?
•• Why is their alignment score 17?
•• What is the length and what is the amino acid sequence of the internal repeat?
The best alignment between two sequences can be calculated using ‘dynamic
programming’ methods. As the time taken to do this is proportional to the
product of the sequence lengths, these methods do not scale for large numbers
of sequences, such as those found in databases. Instead, approximate or ‘heuristic’
algorithms are used to find good (but not necessarily the best) alignments in
reasonable time frames. In this way, pairwise algorithms have been optimised for
database searching, and provide statistical estimates (p-values and e-values) to
give measures of confidence in their results: the p-value gives a probability that the
retrieved match is significant, while the e-value denotes the number of random
matches expected with equal or better scores than that of the retrieved match. The
closer these values are to 0, the more significant the matches are.
Multiple sequences
Aligning multiple sequences is, arguably, more useful than just looking at pairs
of sequences because in this context conserved traits can be seen. Various
alignment-based methods have been developed to help identify conserved
functional and structural traits (binding sites, active sites, domains, etc.), and to
determine their biological significance. The methods vary both in how they use
patterns of similarity to identify such features, and in their diagnostic performance.
As illustrated in Figure 12.4.3, there are three main approaches, depending on
whether the methods use:
•• single motifs (for example, encoded as consensus expressions)
•• multiple motifs (for example, encoded as fingerprints), or
•• complete domains (for example, encoded as profiles).
Single-motif methods
(e.g. PROSITE)
Domain-based methods
(e.g. Pfam)
Key terms
Figure 12.4.3: Alignment-based
pattern-recognition methods. These
methods fall into three main classes,
depending on whether they use single
motifs, multiple motifs or full domains.
Multiple-motif methods
(e.g. PRINTS)
12.4: Analysis methods
4
Unit 12: Bioinformatics
i) Consensus expressions are the simplest, because they distil the residue
conservation information from regions of the full alignment into a single
consensus sequence or ‘pattern’. Here, residues at each position of a single motif
are examined. If a position is conserved (i.e., has the same residue in all sequences
in the alignment), that residue is noted; if several similar residues occur, these are
grouped; if residues share no similarity, an x is given. These observations are used
to create a consensus expression summarising the residue conservation at each
motif position: for example, D-[IVL]-V-[FY]-x-[KRH]-Q. This expression shows:
•• three conserved residues (D,V,Q)
•• three positions where residues share similar properties ([IVL], [FY], [KRH])
•• one non-conserved position (x).
Expressions like this are used to search databases for sequences that share
the same pattern of residue conservation; they work best in diagnosing small
functional sites, and are the basis of the PROSITE database.
Activity
Examine the consensus expression described above.
•• Consider the sequences DVVYAKQ and DIVFGRN. Do the sequences match the expression?
Explain your reasoning.
ii) Fingerprints combine groups of motifs in sets of aligned sequences to create
more powerful diagnostic signatures – the more motifs they include, the more
specific fingerprints become. Residue information within motifs is scored with a
simple unitary matrix. Fingerprints are used to search databases for sequences
that contain the same sets of conserved motifs; they work best at diagnosing
members of closely related families and subfamilies, and are the basis of the
PRINTS database.
iii) Profiles are more complex, because they encapsulate complete domains
(i.e., including regions with insertions and deletions) and use substitution
matrices to score amino acid comparisons. Each alignment position is scored
according to whether it is conserved, or whether a residue has been inserted
or deleted (large insertions and deletions are penalised with negative scores).
Profiles work best at diagnosing divergent members of superfamilies; they
were introduced into PROSITE to offer diagnostic alternatives for some of its
weaker consensus expressions.
Hidden Markov models (HMMs) are like profiles but use probabilities to denote
whether residue positions are conserved, or whether the positions include
insertions or deletions. Like profiles, HMMs are best at diagnosing members of
superfamilies; they are the basis of the Pfam database.
Take it further
Find out more about sequence analysis methods and the need for critical analysis in Introduction
to Bioinformatics (Attwood and Parry-Smith, 1999, Prentice Hall) and in Bioinformatics and
Molecular Evolution (Higgs and Attwood, 2005, Wiley-Blackwell).
12.4: Analysis methods
5
Unit 12: Bioinformatics
2 Methods for determining accuracy and
statistical significance
The methods described in the previous section aim to identify functionally or
structurally related sequences. The challenge is to determine:
i which of the sequences matched are truly related (true-positive)
ii which sequences are unrelated (true-negative).
Figure 12.4.4: Resolving true
from false matches.
Number of matches
The results of database searches often contain a mixture of true and false matches,
at a given scoring threshold (St). This happens when the maximum score of true
negatives (Stn) exceeds the minimum score of true positives (Stp), creating an
overlap region in which some unrelated sequences have been matched in error
(false-positives), and some truly related sequences have been missed (falsenegatives). Figure 12.4.4 illustrates how difficult it is to differentiate correct from
false matches when the scores of true-positive and -negative sequences overlap.
True negatives
Threshold, St
True positives
Stp
False negatives
Stn
False positives
Score
Analysis methods aim to reduce or eliminate the overlap between the truepositive and true-negative distributions, hence reducing or eliminating the
number of false results and improving diagnostic performance. The best
diagnostic performance is achieved when all true family members are matched,
with no false matches or missed sequences. Various statistical measures are used
to evaluate the performance of analysis methods like this, including:
•• sensitivity (or recall), the proportion of the true data-set retrieved
•• specificity (or precision), the proportion of all retrieved matches that belong to
the true data-set
•• accuracy, the proportion of all the results that are true (true-positive and
true-negative).
Link
Activity
Find out more about statistical tests
in Unit 3: Analysis of scientific data
and information.
Examine the distribution of scores illustrated in Figure 12.4.4.
•• Can you suggest a score at which a statistically significant result might not be biologically
relevant? Give your reasoning.
12.4: Analysis methods
••
6
Unit 12: Bioinformatics
Determining the statistical significance of results returned by such analysis
methods will get harder as the onslaught from next- and third-generation
sequencing technologies gains pace. Moreover, determining statistical
significance does not guarantee biological significance; rigorous checks should
therefore always be applied. This is especially true in the field of sequence analysis
and annotation, where errors can easily slip into databases, and then propagate to
other resources that use their data.
Today, annotating sequences with functional, structural and evolutionary
information is still a non-trivial computational task, and therefore requires
machine automation to be appropriately balanced with human supervision.
Take it further
Find out more about bioinformatics analysis methods and statistics in Bioinformatics and
Molecular Evolution (Higgs and Attwood, 2005, Wiley-Blackwell).
Protein sequence analyst
Toni works at the University of Manchester as a protein sequence analyst for the PRINTS protein
fingerprint database. Her work involves running several sequence analysis tools. First, she aligns
an initial set of amino acid sequences, which are believed to represent a given protein family (for
example, the prion proteins: http://www.bioinf.manchester.ac.uk/cgi-bin/dbbrowser/ALIGN/
PRINTShtmlalign.cgi?align=PRION); she examines the alignment to discover the most conserved
regions or motifs. Next, she scans the UniProt database to try to discover additional sequences that
share all of the pinpointed motifs. The challenge she then faces is to discern from the results of these
scans which of the matched sequences are genuine members of the protein family.
At a given scoring threshold, she will find many cases where the matched sequences have high
scores but are not related to the family of interest. Her task is therefore to find the ‘best’ score, where
the number of wrongly matched sequences and the number of missed true family members are
minimised, and the number of correctly matched true family members is maximised. When confident
that she has found the best score to minimise the errors, she can compile her results into a PRINTS
template, which she must then annotate with a range of biological and technical information before
depositing the entry into the database: http://www.bioinf.manchester.ac.uk/cgi-bin/dbbrowser/
PRINTS/DoPRINTS.pl?cmd_a=Display&qua_a=/Full&fun_a=Code&qst_a=PRION.
Note that in the PRINTS entry for the prion family, Toni has identified a sequence that matches only
three of the defining eight motifs; however, she was confident that this was a true family member
(although only a partial match to the fingerprint), because the UniProt annotation indicates that
the sequence is a relative of the prion family from the chicken.
Further reading
Attwood, T. and Parry-Smith, D. (1999) Introduction to Bioinformatics, Prentice Hall. Contains
more information about sequence analysis methods and the need for critical analysis. Refer to
Chapter 6 for further information on pairwise alignments and dotplots, and to Chapter 8 for
background on alignment-based pattern-recognition methods.
Higgs, P. and Attwood, T. (2005) Bioinformatics and Molecular Evolution, Wiley-Blackwell.
Also contains further information about bioinformatics analysis methods and statistics – refer in
particular to Chapters 7 and 9.
Zvelebil, M. and Baum, J. (2007) Understanding Bioinformatics, Garland Science. Chapter 6 offers
further insights into pattern-recognition methods.
12.4: Analysis methods
7
Unit 12: Bioinformatics
Checklist
At the end of this topic guide you should be familiar with the following ideas about bioinformatics:
 many different methods are used to identify patterns in biological data
 in the field of sequence analysis, different methods have been developed to identify
biological characteristics in individual sequences, in pairs of sequences and in sets of
multiple sequences
 alignment of multiple sequences is computationally demanding, and requires the use of
approximate algorithms
 pairwise alignment methods have been optimised for searching multiple sequences
in databases
 multiple alignment-based methods to identify conserved functional and structural traits
in protein sequences typically use single motifs (consensus expressions), multiple motifs
(fingerprints) or domains (profiles, HMMs)
 p-values and e-values are used to indicate the significance of matches retrieved from
database searches
 specificity (precision), sensitivity (recall) and accuracy are statistical measures used to
evaluate the diagnostic performance of analysis methods
 statistically significant matches need not be biologically significant
 results of automatic analyses should be rigorously checked, preferably by humans.
Acknowledgements
The publisher would like to thank the following for their kind permission to reproduce their
photographs:
PhotoDisc: Lawrence Lawry
All other images © Pearson Education
Every effort has been made to trace the copyright holders and we apologise in advance for any
unintentional omissions. We would be pleased to insert the appropriate acknowledgement in any
subsequent edition of this publication.
12.4: Analysis methods
8