Download Sequence identity and homology

Protein homology I: Evolution and comparison of protein sequences Biochem 565, Fall 2007 09/10/07 Cordes Outline 1. What is homology? 2. Sequence conservation and covariation 3. Generalized substitution matrices 4. Pairwise alignment--global vs. local 5. Sequence identity and homology 6. Sequence similarity and homology-use of substitution matrices 7. Alignment scores and statistics 8. Limitations of pairwise alignment 9. Remote homologies--use of evolutionary profiles Homology between different proteins--key terms boxes represent protein-coding genes A1 gene duplication A1 A2 speciation orthologs paralogs A1 homologous paralogous orthologous A2 A1 A2 descended from a common ancestor, e.g. “A1 and A2 are homologous”. Also sometimes defined as “Similar due to descent from a common ancestor.” Homology is either/or-there is no such thing as “percent homology”! Homologous is not a synonym for “similar”! It is, however, possible for only a part of two sequences to be homologous, for instance one domain in multidomain prot. related by gene duplication related by speciation Sequence conservation among homologous proteins If one aligns the sequences of multiple proteins which are related by common ancestry and thus form a family, one can then examine the extent of amino-acid sequence evolution both in general and at particular positions in the alignment. bars here reflect level some positions universally conserved of conservation some are highly variable coloring reflects conserved, chemically similar amino acid types, e.g. charged residues such as Arg, Lys Multiple alignment of six bacteriophage Cro proteins (made with ClustalX). Overall rates of protein sequence evolution vary widely Table. Substitution rates in genes encoding orthologous rodent and human proteins. Units are substitution rates per site per billion years. protein nonsynonymous rate synonymous rate KA/KS histone 3 actin a insulin myoglobin b-globin urokinase 0.00 0.01 0.13 0.56 0.80 1.28 6.38 3.68 4.02 4.44 3.05 3.92 0 0.002 0.03 0.126 0.262 0.362 Nonsynonymous changes are nucleotide substitutions which lead to amino acid changes. Synonymous changes are those which do not. KA/KS is the ratio of nonsynonymous to synonymous changes in the gene, and is a measure of the functional selection on a protein. Proteins with highly constrained function tend to evolve more slowly and have lower KA/KS values. This includes critical proteins with multiple levels of function and regulation, such as histones. adapted from Protein Evolution by L. Patthy, Blackwell Science, 1999 and from Fundamentals of Molecular Evolution by Li & Graur, Sinauer, 1991 Orthologous vs. paralogous proteins As a general rule, orthologous proteins tend to perform the same function in different species, while paralogous proteins tend to have diversified somewhat in function. For example, the chymotrypsin serine proteases are orthologous to each other, and they retain not only the same general function (proteolysis using a catalytic triad including a serine), but also the same substrate specificity (cleavage at positions following aromatic side chains). The chymotrypsins are paralogous to the trypsins and the elastases. These proteins share the same general serine protease function but have evolved different substrate specificities. These proteins also have paralogs which have lost all protease activity. Since the functional constraints are different, sequence conservation patterns might vary depending upon whether one is comparing orthologs or paralogs, or close paralogs vs. distant paralogs. Classic phylogenetic studies of sequence conservation: the globins The globins are the best studied family in terms of sequence conservation, partly because they were one of the first families for which multiple members were sequenced, and partly because some of the earliest protein structures (in fact, the earliest) solved were globins. The classic papers of Perutz, Kendrew and Watson were the first to correlate sequence conservation with aspects of protein structure and function. They drew their conclusion based on only a few aligned sequences. Later globin studies, such as that of Bashford, Chothia and Lesk, expanded the analyses of globin sequence conservation to include hundreds of sequences. Perutz, Kendrew & Watson J Mol Biol 13, 669 (1965) Bashford, Chothia & Lesk J Mol Biol 196, 199 (1987) Scapharca inaequivalvis oxygenated hemoglobin Conservation of functional residues There were only 2 perfectly conserved residues among the 8 known globin structures at the time of the Bashford et al study. These are residues critical in binding of heme and/or interaction w/heme-bound oxygen. It will often be found that the best conserved residues in related Phe 43 proteins are those involved in heme critical aspects of the general function. His 87 Residues involved in more specific aspects of function may or may not be conserved, depending upon the relationship between the proteins under consideration. For example, residues involved in substrate specificity for serine proteases may be conserved among orthologs, such as the chymotrypsins, but not between paralogs, such as chymotrypsins and trypsins. Conservation at buried positions • core residues, which are usually hydrophobic, often tolerate conservative substitutions, i.e. to other hydrophobics • overall core volume is well-conserved (Lim & Ptitsyn, 1970) though individual core positions tolerate variation in volume • this reflects what we know about packing and the effects of core mutations on stability--thus sequence conservation is partly related to maintaining a stable structure portion of alignment of prokaryotic and eukaryotic globins Y140 yellow = small neutral/polar green = hydrophobic red/pink = polar/acidic blue = basic buried H156 human hemoglobin beta chain Conservation at solvent-exposed positions • solvent-exposed (surface) positions are mutable and usually tolerate mutation to many residue types including hydrophobics. Bashford et al., however, noted that for globins at least, some surface positions do not tolerate large hydrophobics. Since polar-to-hydrophobic mutations on protein surfaces do not reduce stability, this conservation could reflect constraints on solubility. Indeed, it is clear that the overall polar character of the surface is conserved for soluble, globular proteins, even though a certain number of hydrophobics may be tolerated. Y140 yellow = small neutral/polar green = hydrophobic red/pink = polar/acidic blue = basic examples of surface residues H156 human hemoglobin beta chain Conservation of loops and turns • “Spacer” regions between secondary structures, such as loops and turns, are often hypermutable and vary not only in sequence but in length, tolerating insertion and deletion events (Insertions and deletions are much less often found within secondary structure elements. Why?) part of alignment of animal hemoglobin a and b chains human a chain Are the a and b chains related to each other by paralogy or orthology? Covariation analysis Substitution patterns at different positions in a sequence alignment are not necessarily independent. This is sometimes referred to as covariation or correlated evolution. name A B C D sequence YADLGRIKS YSDLGSEKE IDDFGEIAA IDDFGVIGT For example, in the mini multiple alignment shown at left, the identity of the residue at the 4th position is correlated to the identity of the residue at the 1st position. A statistical perturbation analysis can be used to characterize this covariation. An alignment of related sequences is “perturbed” by only considering sequences at which, for example, the first position is Y. The effect of this perturbation on the residue distribution observed at other positions is then measured. If the distribution changes significantly, covariation between sequence changes at the first site and other sites in the alignment is inferred. Covariation and hydrophobic core packing The hydrophobic core residues in related proteins tend to be covariant due to constraints on core packing. One sees compensatory volume changes at different positions. Davidson and coworkers found that for 266 aligned SH3 domain sequences, the strongest covariation was observed for a cluster of central hydrophobic residues. For example, substitution of a smaller residue (Ala->Gly) at 39 was strongly correlated to substitution of a larger residue (Ile->Phe) at 50. Hydrophobic core of SH3 domains, with most frequently covarying residues shown in yellow S.M. Larson, A.A. DiNardo and A.R. Davidson, J Mol Biol 303, 433 (2000) Some recent studies (Suel et al) have suggested a connection between covarying clusters of residues and transduction of signals between distant sites in proteins. For example, G-protein coupled receptors bind a ligand on one side of a membrane, and then transduce that signal to the other side through conformational change. Suel et al showed that the main clusters of covarying residues tended to connect the ligand and G-protein binding sites. ligand covarying networks (brown) membrane G-protein binding sites Suel et al. Nat Struct Biol 2003 Sequence logos: a common way of representing sequence conservation convert alignment to logo Note how the logo simplifies reading the conservation, but doesn’t throw away information like simply calculating a “consensus” sequence would. Logos represent sequence conservation in an easy to read format, with letter heights essentially representing the frequency with which a residue type occurs at a position in an alignment, relative to the frequency with which it would occur at random. The units of the y-axis are “bits” of information, which is to say that if a residue did not occur more often than expected at random, it would not offer us any information and the letter height would be zero. Note that the letter heights only become very high when a residue really dominates in the alignment, like Ala at the fifth position here. weblogo server: http://weblogo.berkeley.edu sequence logos paper: Schneider and Stevens, Nucleic Acids Res 18, 6097 (1990). Generalized substitution matrices Conservation patterns observed in related proteins have been used to construct generalized substitution matrices (e.g. BLOSUM, PAM, Gonnet) which reflect the gross average likelihood of a mutation occurring and being accepted in a protein. (handouts in class will have more notes on these) cysteine, tryptophan least mutable polar more mutable than hydrophobic. Polar more likely to be substituted by polar, hydrophobic by hydrophobic the PAM 250 matrix Inferring homology between proteins The simplest way of identifying homology is by sequence comparison. If two protein sequences are sufficiently similar (we’ll talk about what similarity means in a moment), they can be statistically inferred to be homologous. In addition, if a sequence obeys conservation patterns observed in a known family of related sequences, it can be inferred to be a member of that family. For sequences of statistically borderline similarity, structural and functional comparison, if such information is available, can be used as a supplement to establish common ancestry. If similarity between two sequences is really statistically weak, very strong structural and functional similarity can still make a convincing argument for homology. Finally, gene context can play a role--for example, do two genes occupy the same location within an operon in different organisms? We will next focus on identification of homology through sequence comparison. We will begin with simple pairwise comparison. Pairwise alignment of sequences--global and local GLOBAL ALIGNMENT F R T Y I A E W Q R T E P G A D H F Q T Y A A D Y - R T E P S S D H * * * * * * * * * * entire length of sequence aligned--about 60% identity over 17 residues. Note that allowance for gaps improves the % identity. The best alignment would be determined by using some optimization algorithm in combination with a scoring scheme, e.g. +1 for every identity and 0 for every mismatch or gap (identity matrix). - - - - - - - - - R T E P G A D H LOCAL ALIGNMENT - - - - - - - - - R T E P S S D H * * * * * * only the best matching portion(s) of sequence is (are) included in the alignment--75% percent identity over 8 residues. How does a local alignment algorithm decide where to stop? By lengthening the alignment only insofar as it increases the score. For example, one could increase the score by +2 for every identical amino acid, while assigning a penalty of -1 for every mismatch or gap. Such penalties would prevent the alignment from extending to dissimilar regions Pairwise alignment of sequences--global vs. local Local alignment is more versatile than global and is thus more widely used. It can be used to align proteins that are not related throughout their lengths but share a conserved domain, as well as proteins with very unevenly distributed sequence similarity. Many many such cases exist. Thus, when one has no prior knowledge of what to expect, local alignment routines are preferable. This will especially be the case if one is using pairwise alignment to search a database for sequences that are related to a query sequence. Thus, alignment algorithms for database searching essentially always use local alignment. It should be noted that the scoring scheme used can be tailored to favor longer or shorter local alignments. Global alignment is usually used to align sequences that are approximately the same length and are already known to be related. Once we’ve aligned all or part of a pair of sequences, how do we decide whether they are homologous? Percent sequence identity and homology Common rule-of-thumb: 30% identity indicates homology. This is too simplistic! high level of identity between unrelated proteins is common at short alignment lengths don’t worry about this 20-30% identity called the “twilight zone”: difficult to assess relatedness from identity from Brenner et al. PNAS 95, 6073 (1998) the 30% identity threshold for identification of homology only works for long alignments, i.e. >100-150 amino acids Sequence identity and homology: false positives sequence identity is 39% over 64 residues, yet the two proteins are unrelated--this would be a false positive using a 30% cutoff rule. Use of a length-dependent cutoff would help. Note also that gaps are allowed in this alignment--identity would be lower if gaps were not allowed. However, gaps are common among true homologs. from Brenner et al. PNAS 95, 6073 (1998) Sequence identity and homology: poor coverage the two proteins have the same fold,both bind heme and oxygen in same place: good independent structural/functional evidence for homology... Yet alignments of their sequences reveal only 24% identity. There are also many examples of related globins and other proteins with much lower identity than this. 1MBO and 1HBB hemoglobin and myoglobin Any reasonable sequence identity criterion, whether it is a flat percent cutoff or a length-dependent cutoff, will give incomplete coverage--in other words, it will fail to identify many distant but true relationships. “Sequence similarity” Sequence identity is one specific way of assessing sequence similarity, and it’s not a very good one. If you just use sequence identity, you are throwing away a lot of information. Specifically, as we have recently learned, not all mutations are equally likely to occur and be accepted during the course of evolution. Knowledge of what substitutions commonly occur among related proteins can be put to use in aligning sequences and identifying relationships. Various methods have been developed which use such knowledge to assess sequence similarity. The most widely used and familiar of these methods work by using generalized amino acid substitution matrices (aka scoring matrices) in tandem with effective local pairwise alignment algorithms. This is coupled with a statistical assessment of the significance of the alignment score obtained between two sequences using a given matrix. As we’ll see, it’s possible (and often worth doing) to get much more sophisticated than that, but that’s where it begins. Scoring alignments using substitution matrices would moving F over two positions to the left improve or worsen the score? G G 6 D A E R 2 -1 Y Y 7 M - M Q P 5 -11 -1 F L 0 R R 5 D W I D W G 6 11 -4 = gap extension penalty gap opening penalty substitution penalties are just elements from BLOSUM 62 matrix 25 overall score is sum of scores at each position In theory, the score is related to the odds that the alignment represents a an actual homologous relationship between two proteins. Because scoring matrices are in log-odds form, the overall alignment score is the sum of the scores at each position rather than the product. Common pairwise local alignment methods Smith-Waterman dynamic programming algorithm: See handout for description of dynamic programming. Mathematically guaranteed to find highest scoring alignment for a given set of input parameters. Tradeoff is that it is slow, although computer speed is getting to the point where this is less of a problem. The global version of Smith-Waterman is called Needleman-Wunsch. If one were simply comparing any 2 sequences to see if they are homologous, Smith-Waterman would be the method of choice. BLAST (Basic Local Alignment Search Tool) FASTA These two are very similar--both achieve a speed advantage over Smith-Waterman by initially looking for short “words” of 2 or 3 residues that (nearly) exactly match. Alignments are then built from these initial seed matches. The tradeoff for the speed advantage is that some homologies may be missed. Because of their speed, BLAST and FASTA are used in searches of large databases for homologues. Variables in local alignment-based search algorithms scoring matrix the generalized log odds substitution matrix used to score alignment--BLOSUM and PAM are the most commonly used. BLOSUM 62 is default on BLAST and BLOSUM 50 on most FASTA servers gap penalties gap opening penalty (for initiating a gap) gap extension penalty (adding elements to existing gap) “word size” (“ktup” parameter in FASTA). BLAST and FASTA are so fast partly because they start by looking for short “words” that match exactly and build up a longer alignment from these words. The size of the starting words can be varied with this parameter (the shorter the word the more it slows down the program) filter filters sequence to get rid of “low complexity” regions. Such regions can lead to false positives due to their compositional bias. Statistical significance of alignment scores: The extreme value distribution Raw alignment scores by themselves are not particularly meaningful. In order to assess the statistical significance of an alignment, i.e. the chances that it represents a real relationship, one must understand what the distribution of alignment scores would be for random pairs of sequences of similar length and composition. Such scores obey what is called an extreme value distribution, which is like a normal distribution but has a positively skewed tail. The exact characteristics of the distribution will depend upon the scoring matrix, the gap penalties employed, the composition of the sequences, etc. what is probability P that a random alignment will have example of extreme value a given score or distribution higher? # of occurrences vs. alignment score Altschul et al. Nucleic Acids Research 25, 3389 (1997) Statistical significance of alignment scores: Z-scores, P-values and E-values A Z-score is the number of standard deviations between the alignment score and the mean of a normal distribution. The FASTA algorithm reports Z scores in its output. A P-value is the probability that an alignment between two random sequences will have a score equal to or greater than the observed score, as calculated from the extreme value distribution. The E-value or expect value represents the number of times that the observed score or higher would be observed when searching a database of D sequences. For cases where P < 0.1, E ~ D*P. Both FASTA and BLAST report Evalues for alignments. Basically, to be confident that a match between two sequences represents true homology, you generally want an E-value < 0.01. That means there’s a 1 in 100 chance that you have a false positive. It has been shown (Brenner et al. 1998) that FASTA and BLAST Evalues do a pretty good job of distinguishing true and false positives. Sample BLAST output alignment >RCRO_BPP22 MYKKDVIDHFGTQRAVAKALGISDAAVSQWKEVIPEKDAYRLEIVTAGALKYQENAYRQAA score E-value GenBank identifier gi|4539473|emb|CAB39982.1| (AJ237660) Cro protein [Bacterio... gi|12515040|gb|AAG56161.1|AE005346_8 (AE005346) unknown pro... gi|13361674|dbj|BAB35631.1| (AP002557) putative regulatory ... gi|12514991|gb|AAG56121.1|AE005343_11 (AE005343) putative r... gi|118633|sp|P06965|DICC_ECOLI REPRESSOR PROTEIN OF DIVISIO... gi|6093941|sp|Q37907|RCRO_BPD3 REGULATORY PROTEIN CRO (ANTI... gi|9635583|ref|NP_061566.1| Cro [Pseudomonas phage D3] >gi|... gi|118631|sp|P06966|DICA_ECOLI REPRESSOR PROTEIN OF DIVISIO... gi|13559845|ref|NP_112055.1| cII [Bacteriophage HK620] >gi|... gi|7531033|sp|O84102|ACPS_CHLTR HOLO-[ACYL-CARRIER PROTEIN]... 100 47 47 42 40 33 32 32 29 29 4e-21 4e-05 4e-05 0.001 0.005 0.46 0.86 1.5 8.9 9.3 >gi|12515040|gb|AAG56161.1|AE005346_8 (AE005346) unknown protein encoded within prophage CP-933O[Escherichia coli O157:H7 EDL933] Length = 84 Score = 47.0 bits (110), Expect = 4e-05 Identities = 25/53 (47%), Positives = 32/53 (60%) Query: 1 Sbjct: 5 “positives” means positions at which scoring matrix element is positive MYKKDVIDHFGTQRAVAKALGISDAAVSQWKEVIPEKDAYRLEIVTAGALKYQ 53 M K +V+ +FG A ALG S VS W E +P K A ++ VTAGALKY+ MKKSEVLGYFGGVVKTAAALGTSKTTVSMWGEDVPWKWALLIQAVTAGALKYE 57 percent positives is sometimes also called “percent similarity” BLAST and FASTA can identify some homologues in the “twilight zone”--20 to 30% identity Score = 43.5 bits (101), Expect = 0.001 Identities = 36/145 (24%), Positives = 56/145 (37%), Gaps = 2/145 (1%) Query: 2 Sbjct: 4 Query: 62 Sbjct: 62 LSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDL 61 L+ E V +W KV D G + L RL +P T F+ F L T + + + LTPEEKSAVTALWGKVNVDEVGG--EALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV 61 KKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPG 121 K HG VL A L + + L++ H K + + + ++ VL KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK 121 Query: 122 DFGADAQGAMNKALELFRKDIAAKY 146 +F Q A K + +A KY Sbjct: 122 EFTPPVQAAYQKVVAGVANALAHKY 146 BLAST alignment of hemoglobin and myoglobin Even though sequence identity here is low, the E-value is statistically significant Pairwise alignments will not detect all homologies Although pairwise local alignment routines employing generalized substitution matrices do a good job of avoiding false positives, their coverage is imperfect (though it’s better than just using % identity). That is, there will be many relationships that they will miss because the sequences have diverged too far. big challenge in bioinformatics is finding these “remote homologs” homology identified independently in this trial database by structural/functional similarity EPQ means errors per query, ideally like E < 0.01 (1 in 100 chance of false positive) from Brenner et al. PNAS 95, 6073 (1998) SSEARCH is a program employing the Smith-Waterman algorithm Multiple alignment of sequences Conservation patterns observed in families of homologous sequences carry much more useful information than do single sequences, both from the point of view of understanding structure and function for a family, as well as for identifying whether a particular sequences is homologous to a particular family. Obtaining this information depends upon the ability to generate alignments of multiple related sequences: We aren’t going to have time to talk about methods for multiple alignment. Some of the better known methods/websites, such as ClustalX for global multiple alignment, will be listed as links on the course website. I recommend Chapter 4 of David Mount’s Bioinformatics for thorough coverage of the topic. We’re going to focus instead on what one can do with multiply aligned sequences. Position-dependent scoring matrices or “profiles” of sequence families can be generated from multiple alignments row in matrix is constructed by weighting a generalized substitution matrix by the appearance of the different amino acids in the alignment. For example, this row might be made from an equal weight of the E, G, V and L columns in, say, a PAM250 matrix. Gribskov, McLachlan & Eisenberg, PNAS 84, 4355, 1987 The resulting matrix contains position-dependent information about sequence conservation within a particular family of sequences, as opposed to a generalized scoring matrix, which is constructed by averaging general sequence conservation tendencies among many families of related sequences Examples of models generated from multiple alignments profiles these two are almost the same thing position-specific scoring matrices (PSSM) hidden Markov models (HMM) These models can be generated for lengthy sequences or for short ungapped conserved regions (blocks or motifs) PSI-BLAST (Position-Specific Iterated) query sequence initial BLAST search utility obviously depends on getting some seed hits Altschul et al. Nucleic Acids Research 25, 3389 (1997) hits with significant similarity (e.g. E < 0.005) multiple alignment of hits PSSM iterated BLAST search using the PSSM as query the utility of PSI-BLAST in finding more remote homologues than simple pairwise searches has been demonstrated. An example of a similar program that uses a Hidden Markov model rather than a PSSM is SAM-T99 (now SAM-T02) Example of utility of PSI BLAST initial BLAST with cutoff of E <0.01 brings up only BRCT domains from other BRCA1s (orthologues) two BRCT domains from BRCA1 used as query few false positives were found using E<0.01 cutoff repeated rounds of PSI-BLAST bring up many others and reveal first plant protein to contain BRCTs Altschul et al. Nucleic Acids Research 25, 3389 (1997) Searching profile databases query sequence database of HMMs, PSSMs A number of researchers have used similarity searches to cluster the known proteins into homologous groups, and then generated profiles for each cluster using HMMs or PSSMs. Servers now allow one to do similarity searches of these database profiles using a single query sequence. This is qualitatively the reverse of what is done in PSIBLAST, in which one generates a profile and uses it to match individual database sequences. Some of these profiles represent motifs or short ungapped “blocks”, whereas others are the length of entire domains. Among the best known collections of domain profiles are SMART and Pfam. These two form part of what is now called the Conserved Domain Database (CDD). BLAST searches with the NCBI server will now automatically do a search against the CDD unless you opt not to.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Sequence identity and homology