* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Pupko_pairwise
Community fingerprinting wikipedia , lookup
Western blot wikipedia , lookup
Gene expression wikipedia , lookup
DNA barcoding wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Genetic code wikipedia , lookup
Protein adsorption wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Non-coding DNA wikipedia , lookup
Protein domain wikipedia , lookup
Protein structure prediction wikipedia , lookup
Molecular evolution wikipedia , lookup
Point mutation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Homology and sequence alignment. 1 Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig Sequence homology Similarity between sequences as a result of common ancestry. VLSPAVKWAKVGAHAAGHG ||| || |||| | |||| VLSEAVLWAKVEADVAGHG 3 Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. 4 Why align? VLSPAVKWAKV ||| || |||| VLSEAVLWAKV 1.To detect if two sequence are homologous. If so, homology may indicate similarity in function (and structure). 2.Required for evolutionary studies (e.g., tree reconstruction). 3.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site). 5 Sequence alignment If two sequences share a common ancestor – for example human and dog hemoglobin, we can represent their evolutionary relationship using a tree VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV 6 Perfect match A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case). VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV 7 A substitution A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred). VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV 8 Indel Option 1: The ancestor had L and it was lost here. In such a case, the event was a deletion. VLSEAVLWAKV VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV 9 Indel Option 2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion. L VLSEAVWAKV VLSPAV-WAKV ||| || |||| VLSEAVLWAKV VLSPAV-WAKV VLSEAVLWAKV 10 Indel Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel. Deletion? VLSPAV-WAKV Insertion? VLSEAVLWAKV 11 Global vs. Local • Global alignment – finds the best alignment across the entire two sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ • Local alignment – finds regions of similarity in parts of the sequences. ADLG |||| ADLG CDRYFQ |||| | CDRYYQ Global alignment: forces alignment in regions which differ Local alignment will return only regions of good alignment 12 Global alignment PTK2 protein tyrosine kinase 2 of human and rhesus monkey 13 Proteins are comprised of domains Human PTK2 : Domain A Domain B Protein tyrosine kinase domain 14 Protein tyrosine kinase domain In leukocytes, a different gene for tyrosine kinase is expressed. Domain A Domain X Protein tyrosine kinase domain 15 The sequence similarity is restricted to a single domain Domain A Protein tyrosine Domain B PTK2 kinase domain Domain X Protein tyrosine kinase domain Leukocyte TK 16 Global alignment of PTK and LTK 17 Local alignment of PTK and LTK 18 Conclusions Use global alignment when the two sequences share the same overall sequence arrangement. Use local alignment to detect regions of similarity. 19 How alignments are computed 20 Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA One possible alignment: AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- 21 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches 22 Choosing an alignment for a pair of sequences Many different alignments are possible for 2 sequences: AAGCTGAATTCGAA AGGCTCATTTCTGA A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Which alignment is better? 23 Scoring system (naïve) Perfect match: +1 Mismatch: -2 Indel (gap): -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGAScore: = (+1)x10 + (-2)x2 + (-1)x4 = 2 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGAScore: = (+1)x9 + (-2)x2 + (-1)x6 = -1 Higher score Better alignment 24 Alignment scoring - scoring of sequence similarity: Assumes independence between positions: each position is considered separately Scores each position: • Positive if identical (match) • Negative if different (mismatch or gap) Total score = sum of position scores Can be positive or negative 25 Scoring system •In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary •Different scoring systems different alignments •We want a good scoring system… 26 DNA scoring matrices Can take into account biological phenomena such as: • Transition-transversion 27 Amino-acid scoring matrices • Take into account physico-chemical properties 28 Scoring gaps (I) In advanced algorithms, two gaps of one amino-acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened. Gap extension penalty < Gap opening penalty 29 Homology versus chance similarity How to check if the score is significant? A. Take the two sequences Compute score. B. Take one sequence randomly shuffle it -> find score with the second sequence. Repeat 100,000 times. If the score in A is at the top 5% of the scores in B the similarity is significant. 30 How close? • Rule of thumb: • Proteins are homologous if they are at least 25% identical (length >100) • DNA sequences are homologous if they are at least 70% identical 31 Twilight zone • < 25% identity in proteins – may be homologous and may not be…. • (Note that 5% identity will be obtained completely by chance!) 32 Searching a sequence database Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs The same idea in short: Use your sequence as a query to find homologous sequences in a sequence database 33 Some terminology • Query sequence - the sequence with which we are searching • Hit – a sequence found in the database, suspected as homologous 34 Query sequence: DNA or protein? • For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. • Which is preferable? 35 Protein is better! • Selection (and hence conservation) works (mostly) at the protein level: CTTTCA = TTGAGT = Leu-Ser Leu-Ser 36 Query type • Nucleotides: a four letter alphabet • Amino acids: a twenty letter alphabet • Two random DNA sequences will, on average, have 25% identity • Two random protein sequences will, on average, have 5% identity 37 Conclusion The amino-acid sequence is often preferable for homology search 38 How do we search a database? • If each pairwise alignment takes 1/10 of a second, and if the database contains 107 sequences, it will take 106 seconds = 11.5 days to complete one search. • 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank. 39 Conclusion • Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow 40 Heuristic • Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution 41 BLAST 42 BLAST • BLAST - Basic Local Alignment and Search Tool • A heuristic for searching a database for similar sequences 43 DNA or Protein • All types of searches are possible Query: DNA Protein Database: DNA Protein blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database Translated databases: trEMBL genPept 44 BLAST - underlying hypothesis • The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them • The heuristic: 1. Discard irrelevant sequences 2. Perform exact local alignment only with the remaining sequences 45 How do we discard irrelevant sequences quickly? • Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA) • Save the words in a look-up table that can be searched quickly WTDFGYPAILKGGTAC WTD TDF DFG FGY GYP … 46 BLAST: discarding sequences • When the user enters a query sequence, it is also divided into words • Search the database for consecutive neighboring words 47 Neighbor words • neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with a certain cutoff level GFB GFC (20) GPC (11) WAC (5) 48 E-value • The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a significant homology. E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous). E-values between 10-2 and 1 do not indicate a good homology 49 Web servers for pairwise alignment BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment • Does not use an exact algorithm but a heuristic Back to NCBI BLAST – bl2seq Bl2Seq - query blastn – nucleotide blastp – protein Bl2seq results Bl2seq results Match Gaps Similarity Dissimilarity Low complexity BLAST – programs Query: DNA Protein Database: DNA Protein BLAST – Blastp Blastp - results Blastp – results (cont’) Blast scores: • Bits score – A score for the alignment according to the number of similarities, identities, etc. • Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a random database of a particular size. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog Blastp – acquiring sequences blastp – acquiring sequences Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWWSNG-- Similar to pairwise alignment BUT n sequences are aligned instead of just 2 64 Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWWSNG-- MSA = Multiple Sequence Alignment Each row represents an individual sequence Each column represents the ‘same’ position 65 Conserved positions • Columns in which all the sequences contain the same amino acids or nucleotides • Important for the function or structure VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGSSSNIGS--ITVNWYQQLPG LRLSCTGSGFIFSS--YAMYWYQQAPG LSLTCTGSGTSFDD-QYYSTWYQQPPG 66 Consensus sequence A consensus sequence holds the most frequent character of the alignment at each column A T C T TGT AAC T TGT AAC T T CT AAC T TGT 67 Profile = PSSM = Position Specific Score Matrix A T C T TG AAC T TG AAC T T C 1 2 3 4 5 6 A 1 .67 0 0 0 0 C 0 0 1 0 0 0.33 G 0 0 0 0 0 0.67 T 0 .33 0 1 1 0 68 Alignment methods There is no available optimal solution for MSA – all methods are heuristics: • Progressive/hierarchical alignment (Clustal) • Iterative alignment (mafft, muscle) 69 Progressive alignment A B C D E First step: Compute the pairwise alignments for all against all (6 pairwise alignments). The similarities are converted to distances and stored in a table 70 A B C D A B 8 C 15 17 D 16 14 10 E 32 31 31 32 E Second step: A B C D E A Cluster the sequences to create a tree (guide tree): B 8 C 15 17 D 16 14 10 The 32 31 31 •represents the order in whichguide pairs of treeEis imprecise sequences are to be aligned and is NOT the tree which •similar sequences are neighbors in the truly describes the tree •distant sequences are distant from each A evolutionary relationship other in the tree 32 between the sequences! B C D E 71 Third step: A sequence sequence B C D E sequence sequence 1. Align the most similar (neighboring) pairs 72 Third step: A sequence B C profile D E 2. Align pairs of pairs 73 Third step: A B C D E profile sequence Main disadvantages: •Sub-optimal tree topology •Misalignments resulting from globally aligning pairs of sequences. 74 Iterative alignment A B C D E Pairwise distance table Iterate until the MSA does not change (convergence) Guide tree 75 A B C D E MSA Case study: Using homology searching • The human kinome 76 Kinases and phosphatases 77 Multi-tasking enzymes • • • • • • Signal transduction Metabolism Transcription Cell-cycle Differentiation Function of nervous and immune system • … • And more 78 How many kinases in the human genome? • 1950’s, discovery that reversible phosphorylation regulates the activity of glycogen phosphorylase • 1970’s, advent of cloning and sequencing produced a speculation that the vertebrate genome encodes as many as 1,001 kinases 79 How many kinases in the human genome? • 2001 – human genome sequence … • As well – databases of Genbank, Swissprot, and dbEST • How can we find out how many kinases are out there? 80 The human kinome • In 2002, Manning, Whyte, Martinez, Hunter and Sudarsanam set out to: 1. Search and cross-reference all these databases for all kinases 2. Characterize all found kinases 81 ePKs and aPKs Eukaryotic protein kinases (majority) catalytic domain Atypical protein kinases Sequence homology of the catalytic domain; additional regulatory domains are non-homologous No sequence homology to ePKs; some aPK subfamilies have structural 82 similarity to ePKs The search • Several profiles were built: based on the catalytic domain of: (a) 70 known ePKs from yeast, worm, fly, and human with > 50% identity in the ePK domain (b) each subfamily of known aPKs • HMM-profile searches and PSI-BLAST searches were performed 83 The results… • 478 ePKs • 40 aPKs • Total of 518 kinases in the human genome (half of the prediction in the 1970’s) [1.7% of human genes] 84