* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sequence Alignment Techniques
Magnesium transporter wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Gene expression wikipedia , lookup
Western blot wikipedia , lookup
Protein moonlighting wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Protein adsorption wikipedia , lookup
Genetic code wikipedia , lookup
Proteolysis wikipedia , lookup
Non-coding DNA wikipedia , lookup
Protein domain wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Molecular evolution wikipedia , lookup
Protein structure prediction wikipedia , lookup
Point mutation wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Sequence Alignment Techniques In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment Part 1 Searching for Sequence Similarity Sequence similarity searches • Sequence similarity searches of database enable us to extract sequences that are similar to a query sequence • Information about these extracted sequences can be used to predict the structure or function of the query sequence • Prediction using similarity is a powerful and ubiquitous idea in bioinformatics. The underlying reason for this is molecular evolution Sequence alignment • Any pair of DNA sequence will show some degree of similarity • Sequence alignment is the first step in quantifying this in order to distinguish between chance similarity and real biological relationships • Alignments show the differences between sequences and changes (mutations), insertions or deletions (indels or gaps) and can be interpreted in evolutionary terms Alignment algorithms • Dynamic programming algorithms can calculate the best alignment of two sequences • Well-known variants are – the Smith-Waterman algorithm (local alignments) – the Needleman-Wunsch algorithm (global alignments) • Local alignments are useful when sequences are not related over their full lengths, e.g., proteins sharing only certain domains or DNA sequences related only in exons Alignment scores and gap penalties • A simple alignment score measures the number or proportion of identically matching residues • Gap penalties are subtracted from such scores to ensure that alignment algorithms produce biologically sensible alignments without many gaps • Gap penalties may be constant (independent of the length of the gap), proportional (proportional to the length of the gap) or affine (containing gap opening and gap extension contributions) • Gap penalties can be varied according to the desired application Similarity and homology • Similarity may exist between any sequences • Sequences are homologous only if they have evolved from a common ancestor • Homologous sequences often have similar biological functions (orthologs), but the mechanism of gene duplication allows homologous sequences to evolve different functions (paralogs) Similarity search in databases • Sequences similar to a query can be found in a database by aligning it to each database sequence in turn and returning the highest scoring (most similar) sequences • This can be achieved by dynamic programming algorithms but in practice faster approximate methods are often used Statistical scores • The p value of a similarity score is the probability of obtaining a score at least as high in a chance similarity between two unrelated sequences of similar composition • Low p values indicate significance matches that are likely to have real biological significance • The related E value is the expected frequency of chance occurrences scoring at least as high as the identified similarity • A low p value for a similarity between two sequences can translate into a high E value for a search of a large database Sensitivity and specificity • These measures quantify the success of a database search strategy • Sensitivity measures the proportion of real biological sequence relationships in the database that were detected as hits in the search • Specificity is the proportion of the hits corresponding to real biological relationships • Changing E and p value thresholds results in a trade-off between these complementary measures of success Maximizing amino acid identities • Protein sequences can be aligned to maximize amino acid identities, but this will not reveal distant evolutionary relationships Evolution • Protein-coding sequences evolve slowly compared with most other parts of the genome, because of the need to maintain protein structure and function • An exception to this is the fast evolution that might occur in the redundant copy of a recently duplicated gene Allowed changes • Changes in protein sequences during evolution tend to involve substitutions between amino acids with similar properties because these tend to maintain the structural stability of the protein Substitution score matrices • These matrices give scores for all possible amino acid substitutions during evolution • Higher scores indicate more likely substitutions • Example matrices are BLOSUM62 and PAM250 • PAM stands for Accepted Point Mutations, and in this case, the evolutionary distance of the matrix is 250 amino acid changes per 100 residues • Dynamic programming algorithms for sequence alignment can operate using scores from these matrices Significance of score matrices • Substitution score matrices allow detection of distant evolutionary relationships between protein sequences • It is possible to detect much more distant relationships by comparing protein sequences than by comparing nucleic acid sequences MATLEKLMKA PPPPPPPPPP AVAEEPLHRP PEFQKLLGIA MSDNLPRLQL PQKCRPYLVN FESLKSFQQQ PQLPQPPPQA KKELSATKKD MELFLLCSDD ELYKEIKKNG LLPCLTRTSK QQQQQQQQQQ QPLLPQPQPP RVNHCLTICE AESDVRMVAD APRSLRAALW RPEESVQETL QQQQQQQQQQ PPPPPPPPGP NIVAQSVRNS ECLNKVIKAL RFAELAHLVR AAAVPKIMAS Part of the sequence of human Huntington’s disease protein (Huntingtin) showing low complexity regions (underlined) associated with compositional bias towards glutamine (Q) and proline (P) 0 PLEK_HUMAN (horizontal) vs. PLEK_HUMAN (vertical) 100 200 300 400 50 100 150 200 250 300 350 400 A dot plot of human pleckstrin sequence against itself produced with Erik Sonnhammer’s ‘dotter’ program. The sequence is plotted from N- to C- terminus along horizontal and vertical axes between residues 1 and approximately 350. C S T P A G N D E Q B R K M I L V F Y W 12 0 –2 –1 –2 –3 –4 –5 –5 –5 –3 –4 –5 –5 –3 –6 –2 –4 0 –8 C 2 1 1 1 1 1 0 0 –1 –1 0 0 –2 –1 –3 –3 –3 –3 –2 S 3 0 1 0 0 0 0 –1 –1 –1 0 –1 0 –2 0 –3 –3 –5 T 6 1 –1 –1 –1 –1 0 0 0 –1 –2 –2 –3 –1 –5 –5 –6 P The PAM250 matrix and alignment of sequences. Total alignment scores for two matrices should not be compared, but note that the PAM matrix is able to detect a much better alignment in second halves of these sequences rather than identity matrix. With the introduction of a single gap, sensible alignments of hydrophobic amino acids, and alignment of K with R (both basic), D with E (both acidic) and F with Y (both aromatic) can be seen 2 1 0 0 0 0 –1 –2 –1 –1 –1 –2 0 –4 –3 –6 A 5 0 1 0 –1 –2 –3 –2 –3 –3 –4 –1 –5 –5 –7 G 2 2 1 1 2 0 1 –2 –2 –3 –2 –4 –2 4 N 4 3 2 1 –1 0 –3 –2 –4 –2 –6 –4 7 D 4 2 4 –1 0 –2 –2 –3 –4 –5 –4 7 E 4 3 1 1 –1 -2 -2 -2 -5 –4 5 Q Sequence 1: MIIVKP –VVLKGDFG Sequence 2: MILLKP AIIIRAEYPosition score: 656256 044231370 6 2 0 –2 -2 -2 -2 –2 0 3 H 5 3 0 -2 –3 –2 –4 –4 2 R 5 0 6 –2 2 5 –3 4 2 6 –2 2 4 2 4 –5 0 1 2 –1 –4 –2 –1 –1 –2 –3 –4 –5 –5 –6 K M I L V 9 7 10 0 0 17 F Y W Figure 3. Display of the DNA unit. DNA can be described at several levels of detail. At the most detailed level, DNA can be characterized by the 5' and 3' termini at both external and internal positions; at the most abstract level, the substrate DNA can be one of 16 common structures. The goal is to provide methods for specifying the properties of DNA in as many ways as is natural for a scientist. Figure 7. An initial experimental environment. The temperature is 37 degrees Celsius and the pH value is 7.4. No DNA polymerase I activity is possible Part 2 Multiple Sequence Alignment Non specific sequence similarity • Certain types of sequence similarity are less likely to be indicative of an evolutionary relationship than others are • Examples of this are similarity between regions of low compositional complexity, short period repeats and protein sequences coding for generic structures like coiled coils Similarity search filters • Regions of the non specific sequence types can degrade the results of similarity searches and are often filtered out of query sequences prior to searching • The programs SEG and DUST can be used to detect and filter low complexity sequences, XNU can filter short period repeats and COILS can detect the presence of potential coiled coil structures Database types for searches • Database and query sequences can be protein or nucleic acid sequences and different query strategies are required for different types and combinations • In general, searches are more sensitive using strategies where protein-coding nucleic acid database and/or query sequences are first translated to protein sequences Iterative database searches • PSI-BLAST is an iterative search method that improves on the detection rate of BLAST and FASTA • Each iteration discovers intermediate sequences that are used in a sequence profile to discover more distant relatives of the query sequence in subsequent iterations • Potential problems with PSI-BLAST are associated with the potential for unrelated sequences to pollute the iterative search, and difficulties associated with the domain structure of proteins • PSI-BLAST often detects up to twice as many evolutionary relationships as BLAST Multiple sequence alignment • Multiple alignment illustrates relationships between two or more sequences • When the sequences involved are diverse, the conserved residues are often key residues associated with maintenance of structural stability or biological function • Multiple alignments can reveal many clues about protein structure and functions Multiple alignment Part of a (artificial) multiple alignment of a family consisting of 7 sequences, which subdivide into 3 subfamilies. The bars on the left indicate subfamilies; the dotted boxes highlight conservation patterns. Progressive sequence alignment • Most commonly used software uses the method of progressive alignment • This is a fast method, but frozen-in errors mean that it does not always work perfectly • Biological knowledge can provide information about likely alignments, and where automatically produced alignments turn out to be imperfect, software for manual alignment editing is required Protein families • Assigning sequences to protein families is a very valuable way of predicting protein family (consensus sequences, conserved residues, residue patterns, sequence profiles, etc.) • Many ways have been developed to represent protein family information and these have been stored in secondary protein family databases Consensus sequences • These condenses the information from a multiple alignment into single sequence • Their main shortcoming is the inability to represent any probabilistic information apart from the most common residue at a particular position • Derivation of consensus sequence illustrates that any protein family representation is subject to bias if the set of sequences from which it was derived is biased PRINTS and BLOCKS • These represent protein families of multiply aligned ungapped segments (motifs) derived from the most highly conserved regions of sequences • By representing more of the sequence, they have the potential to be more sensitive than short PROSITE patterns • The ability to match in only a subset of the motifs associated with a particular family means that they have the ability to detect splice variants and sequence fragments and to represent subfamilies • WWW-based search engines for the databases are available Protein domain families • Many proteins are built up from domains in a modular architecture • The study of protein families is best pursued as a study of protein domain families • Prodom is a database of protein domain sequences created by automatic means from the protein sequence databases Resources for domain families • Pfam and SMART can be used for protein domain family analysis • The integrated resource Interpro unites PROSITE, PRINTS, Pfam, Prodom and SMART Visualization of similarities • Dot plots are a very good way to visualize sequence similarity and find repeats