Download Pairwise alignment

Sequence similarity Sequence Comparison Much of bioinformatics involves sequences • DNA sequences • RNA sequences • Protein sequences We can think of these sequences as strings of letters • DNA & RNA: alphabet of 4 letters • Protein: alphabet of 20 letters Sequence Comparison - Motivation • Nucleotide – Learn about evolutionary relationships – Finding genes, domains, signals … • Protein – Learn about evolutionary relationships – Classify protein families (function, structure) – Identify common domains (function, structure) Calculation of an alignment score How do we align two sequences? ATTGCAGTGATCG ATTGCGTCGATCG Solution 1 Solution 2 ATTGCAGTGATCG ||||| ||||| ATTGCGTCGATCG ATTGCAGT-GATCG ||||| || ||||| ATTGC-GTCGATCG 10 matches | , 3 mismatches 12 matches |, 2 gaps - Which alignment is better? We will use a scoring scheme Match +1 +1 Mismatch –1 0 Indel(gap) -2 -2 Solution 1 ATTGCAGTGATCG ||||| ||||| ATTGCGTCGATCG 10 matches, 3 mismatches Solution 2 ATTGCAGT-GATCG ||||| || ||||| ATTGC-GTCGATCG 12 matches, 2 gaps 10X1+3X(-1) = 7 12X1+2X(-2) = 8 10X1+3X(0) = 10 12X1+2X(-2) = 8 Scoring Alignments - intuition • Similar sequences evolved from a common ancestor • Evolution changed the sequences from this ancestral sequence by mutations: – Replacements: one letter replaced by another – Deletion: deletion of a letter – Insertion: insertion of a letter • Scoring of sequence similarity should examine how many operations took place Causes for sequence (dis)similarity mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA) insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA) deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G) indel: an insertion or a deletion Gaps • Positions at which a letter is paired with a null are called gaps. • Gap scores are typically negative. • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. Gap Opening • The gap-opening penalty defines the cost for opening a gap in one of the sequences. • If you raise the gap-opening penalty above default, local alignments that contain gaps may be split into several shorter alignments. Affine Gap Penalties • In nature, a series of indels often come as a single event rather than a series of single nucleotide events: This is more likely. Normal scoring would give the same score This is less for both alignments likely. Gap = Gapopen + Len * Gapextend Gap penalties lead to: • Increasing penalties for gaps opening and extension – The alignment will contain fewer gaps and more mismatches • Decreasing penalties for gaps opening and extension – The alignment will contain more gaps (of varied lengths) and fewer mismatches • Holding same score of penalty for gap opening and increasing penalty for gap extension – Very long gaps will not be tolerated – they will be replaced with additional gaps of medium length and with mismatches. Sequence similarity Global alignment A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique. Global _____ _______ __ ____ ____ Local __ ____ __ ____ Local alignment Local alignment methods find related regions within sequences - they can consist of a subset of the characters within each sequence. For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B. This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins can be identified as being related. Global _____ _______ __ ____ ____ Local __ ____ __ ____ Global vs. Local: Global Local Global vs. Local: • Use global alignment if – You expect, based on some biological information, that your sequences will match over the entire length. – Your sequences are of similar length. • Use local alignment if – You expect that only certain parts of two sequences will match (as in the case of conserved segment that can be found in many different proteins). – Your sequences are very different in length. – You want to search a sequence database (we will talk about it in details later). If two proteins share more than one common region, for example one has a single copy of a particular domain while the other has two copies, it may be possible to "miss" one of the two copies if using local alignment, which presents only the best scoring alignment. Emboss [best solution] vs. Lalign (Embnet) [several solutions] Comparing nucleotides • Every match got the same score • Every mismatch got the same score • Gaps- we decided but default usually good. • However In the case of aa • Not all matches are the same • Different mismatches get different scores Amino acid properties Serine (S) and Threonine (T) have similar physicochemical properties Aspartic acid (D) and Glutamic acid (E) have similar properties => Substitution of S/T or E/D occurs relatively often during evolution => Substitution of S/T or E/D should result in scores that are only moderately lower than identities Each aa is characterized by a combination of features (size, charge, etc.). The relative importance of each feature may vary according to the aa role in the 3-D structure and function of the protein. So how can we score matches and mismatches? Amino Acids Substitution Matrices The PAM and BLOSUM substitution matrices describe the likelihood that two residue types would mutate to each other. These matrices are based on biological sequence information: the substitutions observed in structural (BLOSUM) or evolutionary (PAM) alignments of well studied protein families These scoring systems have a probabilistic foundation. PAM series - Percent Accepted Mutation (Accepted by natural selection) • All the PAM data come from alignments of closely related proteins (>85% amino acid identity) from 71 protein families (total of 1572 protein sequences). Some of the protein families are: Ig kappa chain Kappa casein Lactalbumin Hemoglobin a Myoglobin Insulin Histone H4 Ubiquitin • PAM matrices are based on global sequence alignments - these include both highly conserved and highly mutable regions. Various degrees of conservation The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity). PAM series - Percent Accepted Mutation (Accepted by natural selection) Varying degrees of conservation * THE BLOSUM Family of Matrices Blocks Substitution MatricesHenikoff and Henikoff, 1992 • Blocks are short conserved patterns of 3-60 aa long. • Proteins can be divided into families by common blocks. Block A B C D • Different BLOSUM matrices emerge by looking at sequences with different identity percentage. Example: BLOSUM62 is derived from an alignment of sequences that share no less than 62% identity. The Blocks Database Gapless alignment blocks A R N D C Q E G H I L K M F P S T W Y V 4 -1 5 -2 0 6 -2 -2 1 6 0 -3 -3 -3 9 -1 1 0 0 -3 5 -1 0 0 2 -4 2 5 0 -2 0 -1 -3 -2 -2 6 -2 0 1 -1 -3 0 0 -2 8 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -1 2 0 -1 -1 1 1 -2 -1 -3 -2 5 -1 -2 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C Q E G H I L K M F P S T W Y V Blosum62 scoring matrix PAM versus BLOSUM  Based on an explicit evolutionary model  Based on empirical frequencies  Derived from small, closely related proteins with ~15% divergence  Uses much larger, more diverse set of protein sequences (30-90% ID)  Higher PAM numbers to detect more remote sequence similarities  Lower BLOSUM numbers to detect more remote sequence similarities  Errors in PAM 1 are scaled 250X in PAM 250  Errors in BLOSUM arise from errors in alignment Guidelines • Lower PAMs and higher Blosums find short local alignment of highly similar sequences • Higher PAMs and lower Blosums find longer weaker local alignment • No single matrix answers all questions Guidelines • BLOSUM is generally better than PAM for local alignments. • The default matrix is often identity matrix for DNA and BLOSUM 62 for proteins • When using BLOSUM80 instead of BLOSUM45, local alignments tend to be shorter. • Low PAMs have same effects as high Blosums. BLOSUM indicates percent identity while PAM is proportional to the percent of accepted mutations.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Pairwise alignment