* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lesson 2 - The George S. Wise Faculty of Life Sciences
Promoter (genetics) wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Expanded genetic code wikipedia , lookup
Gene expression wikipedia , lookup
Biochemistry wikipedia , lookup
Western blot wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Protein moonlighting wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Protein adsorption wikipedia , lookup
Genetic code wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Non-coding DNA wikipedia , lookup
Proteolysis wikipedia , lookup
Mitogen-activated protein kinase wikipedia , lookup
Protein structure prediction wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Molecular evolution wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Lessons 3-4 Sequence homology and alignment 1 Homology Similarity between characters due to a common ancestry 2 Sequence homology Similarity between sequences that results from a common ancestor VLSPAVKWAKVGAHAAGHG VLSEAVLWAKVEADVAGHG Basic assumption: Sequence homology → similar structure/function 3 Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. 4 Homology Ortholog – homolog with similar function (via speciation) Paralog – homolog which arose from gene duplication Common use: Orthologs – 2 homologs from different species Paralogs – 2 homologs within the same species 5 How close? Rule of thumb: Proteins are homologous if 25% identical (length >100) DNA sequences are homologous if 70% identical 6 Twilight zone < 20% identity in proteins – may be homologous and may not be…. (Note that 5% identity will be obtained completely by chance!) 7 Why sequence alignment? Predict characteristics of a protein – use the structure/function of known proteins for predicting the structure/function of an unknown protein 8 Sequence modifications Sequences change in the course of evolution due to random mutations Three types of mutations: 1. 2. 3. Insertion - an insertion of a letter or several letters to the sequence. AAGA AAGTA Deletion - deleting a letter (or more) from the sequence. AAGA AGA Substitution - replacing a sequence letter by another. AAGA AACA Insertion or Deletion = Indel 9 Local vs. Global Global Global alignment: forces alignment in regions which differ alignment – finds the best alignment across the entire two sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local Local alignment will return only regions of good alignment alignment – finds regions of similarity in parts of the sequences. ADLG |||| ADLG CDRYFQ |||| | CDRYYQ 10 When global and when local? 11 Global alignment PTK2 protein tyrosine kinase 2 of human and rhesus monkey 12 Protein tyrosine kinase domain 13 Protein tyrosine kinase domain Human PTK2 and leukocyte tyrosine kinase Both function as tyrosine kinases, in completely different contexts Ancient duplication 14 Global alignment of PTK and LTK 15 Local alignment of PTK and LTK 16 Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA One possible alignment: AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- 17 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2 mismatches 4 indels (gap) 10 perfect matches 18 Choosing an alignment: Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Which alignment is better? 19 Alignment scoring - scoring of sequence similarity: Assumes independence – Each position considered separately Score at each position – Positive if identical – Negative if different or gap Score of an alignment is the sum of position scores – Can be positive or negative 20 Example - naïve scoring system: Perfect match: +1 Mismatch: -2 Indel (gap): -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGAScore: = (+1)x10 + (-2)x2 + (-1)x4 = 2 A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGAScore: = (+1)x9 + (-2)x2 + (-1)x6 = -1 Higher score Better alignment 21 Scoring system: The choice of +1,-2, and -1 scores is quite arbitrary Different scoring systems different alignments Scoring systems implicitly represent a particular theory of evolution – Some mismatches are more plausible Transition vs. Transversion LysArg ≠ LysCys – Gap extension Vs. Gap opening 22 Scoring matrix Representing the scoring system as a table or matrix n n (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) symmetric A G C A 2 G -6 2 C -6 -6 2 T -6 -6 -6 T 2 23 DNA scoring matrices Uniform substitution in all nucleotides: From To A A 2 G -6 2 C -6 -6 2 T -6 -6 -6 Mismatch G C Match T 2 24 DNA scoring matrices Can take into account biological phenomena such as: Transition-transversion 25 Amino Acid Scoring Matrices Take into account physico-chemical properties 26 Amino Acid Substitutions Matrices Actual substitutions: – Based on empirical data – Commonly used by many bioinformatic programs – PAM & BLOSUM 27 Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M M M M M M M M G G G G G G G G Y Y Y Y Y Y Y Y D D E D Q D E E E E E E E E E E In the fourth column E and D are found in 7 / 8 28 PAM Matrix - Point Accepted Mutations Based on a database of 1,572 changes in 71 groups of closely related proteins (85% identity) – Alignment was easy Counted the number of the substitutions per amino acid pair (20 x 20) Found that common substitutions occurred between chemically similar amino acids 29 PAM Matrices Family of matrices PAM 80, PAM 120, PAM 250 The number on the PAM matrix represents evolutionary distance Larger numbers are for larger distances 30 Example: PAM 250 Similar amino acids have greater score 31 PAM - limitations Only one original dataset Examining proteins with few differences (85% identity) Based mainly on small globular proteins so the matrix is biased 32 BLOSUM Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset BLOSUM observes significantly more replacements than PAM, even for infrequent pairs 33 BLOSUM: Blocks Substitution Matrix Based on BLOCKS database – ~2000 blocks from 500 families of related proteins – Families of proteins with identical function AABCDA----BBCDA Blocks are short conserved patterns of 3-60 aa long without gaps DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC 34 BLOSUM Each block represent sequences alignment with different identity percentage For each block the amino-acid substitution rates were calculated to create BLOSUM matrix 35 BLOSUM Matrices BLOSUMn is based on sequences that shared at least n percent identity BLOSUM62 represents closer sequences than BLOSUM45 36 Example : Blosum62 derived from block where the sequences clustered share at least 62% identity 37 PAM vs. BLOSUM PAM100 PAM120 PAM160 PAM200 PAM250 = = = = = BLOSUM90 BLOSUM80 BLOSUM60 BLOSUM52 BLOSUM45 More distant sequences 38 Scoring system = substitution matrix + gap penalty 39 Gap penalty We penalize gaps Scoring for gap opening & for extension: – Gap-extension penalty < gap-open penalty 40 Optimal alignment algorithms Needleman-Wunsch (global) Smith-Waterman (local). 41 Alignment Search Space The “search space” (number of possible gapped alignments) for optimally aligning two sequences is exponential in the length of the sequences, n. If n=100, there are 100100 = 10200 = 100,000,000,000,000,000,000,000,000,000,000,000,000,000,0 00,000,000,000,000,000,000,000,000,000,000,000,000,000,00 0,000,000,000,000,000,000,000,000,000,000,000,000,000,000, 000,000,000,000,000,000,000,000,000,000,000,000,000,000,0 00,000,000,000,000,000,000,000,000,000 different alignments. Average protein length is about n=250! 42 Searching databases 43 Searching a database Using a sequence as the query to find homologous sequences in the database 44 DNA or protein? For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. Which is preferable? 45 Protein is better! Selection (and hence conservation) works on the protein level: CTTTCA = TTGAGT = Leu-Ser Leu-Ser 46 Query type Nucleotides: a four letter alphabet Amino acids: a twenty letter alphabet • Two random DNA sequences will share on average 25% of identity • Two random protein sequences will share on average 5% of identity 47 Conclusions Using the amino acid sequence is preferable for homology search Why use a nucleotide sequence after all? No ORF found, e.g. newly sequenced genome No similar protein sequences were found Specific DNA databases are available (EST) 48 Some terminology Query sequence - the sequence with which we are searching Hit – a sequence found in the database, suspected as homologous 49 How do we search a database? Assume we perform pairwise alignment of the query against all the sequences in the database Exact pairwise alignment is O(mn) ≈ O(n2) (m – length of sequence 1, n – length of sequence 2) 50 How much time will it take? O(n2) computations per search. Assume n=200, so we have 40,000 computations per search Size of database - ~60 million entries 2.4 x 1012 computations for each sequence search we perform! Assume each computation takes 10-6 seconds 24,000 seconds ≈ 6.66 hours for each sequence search 150,000 searches (at least!!) are performed per day 51 Conclusion Using the exact comparison pairwise alignment algorithm between query and all DB entries – too slow 52 Heuristic Definition: a heuristic is a design to solve a problem that does not provide exact solution (but is not too bad) and reduces the time complexity of the exact solution 53 BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences 54 DNA or Protein All types of searches are possible. Query: DNA Protein Database: DNA Protein blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database Translated databases: trEMBL genPept 55 BLAST - underlying hypothesis 1. 2. The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between the two The heuristic: Discard irrelevant sequences Perform exact local alignment with remaining sequences 56 How do we discard irrelevant sequences quickly? Divide the database into words of length w (w = 3 for protein and w = 7 for DNA) Save the words in a look-up table that can be searched quickly WTDFGYPAILKGGTAC WTD TDF DFG FGY GYP … 57 BLAST: discarding sequences When the user gives a query sequence, divide it also into words Search the database for consecutive neighbor words 58 Neighbour words neighbor words are defined according to a scoring matrix (e.g. BLOSUM62 for proteins) with a certain cutoff level GFB GFC (20) GPC (11) WAC (5) 59 Search for consecutive words Neighbor word Look for a seed: hits on the same diagonal which can be connected Database record At least 2 hits on the same diagonal with distance which is smaller than a predetermined cutoff Query This is the filtering stage – many unrelated hits are filtered, saving lots of time! 60 Try to extend alignment Stop extending when the score of the alignment drops X beneath maximal score obtained this far Discard segments with score < S ASKIOPLLWLAASFLHNEQAPALSDAN JWQEOPLWPLAASOIHLFACNSIFYAS Score=15 X=4 Score=17 Score=14 62 The result – local alignment The result of BLAST will be a series of local alignments between the query and the different hits found 63 E-value The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a significant homology. E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous). E-values between 10-2 and 1 are suspicious… 64 Filtering low complexity Low complexity regions : e.g., Proline rich areas (in protein), Alu repeats (in DNA) Regions of low complexity generate high score of alignment BUT – this does not indicate homology 65 Solution In BLAST there is masking of lowcomplexity regions in the query sequence (such regions are represented as XXXXX in query) 66 Multiple sequence alignment 67 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWWSNG-- Like pairwise alignment BUT compare n sequences instead of 2 Rows represent individual sequences Columns represent ‘same’ position May be gaps in some sequences 68 MSA & Evolution MSA can give you a picture of the forces that shape evolution! Important amino acids or nucleotides are not “allowed” to mutate Less important positions change more easily 69 Conserved positions Columns where all the sequences contain the same amino acids or nucleotides Important for the function or structure VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGSSSNIGS--ITVNWYQQLPG LRLSCTGSGFIFSS--YAMYWYQQAPG LSLTCTGSGTSFDD-QYYSTWYQQPPG 70 Consensus Sequence The consensus sequence holds the most frequent character of the alignment at each column A T C T T G T A A C T T G T A A C T T C T A A C T T G T 71 Profile 1 2 3 4 5 6 A T C T T G T A 1 0.67 0 0 . . A A C T T G T T 0 0.33 1 1 . . A A C T T C T C 0 0 0 0 . . G 0 0 0 0 . . Profile = PSSM – Position Specific Score Matrix 72 Alignment methods Progressive alignment (Clustal) Iterative alignment (mafft, muscle) All methods today are an approximation strategy (heuristic algorithm), yield a possible alignment, but not necessarily the best one 73 Progressive alignment First step: A B C D Compute the pairwise alignments for all against all (6 pairwise alignments) the similarities are stored in a table A B C D A B 11 C 3 1 D 2 2 10 74 Second step: A B C D A cluster the sequences to create a tree (guide tree): B 11 C 3 1 D 2 2 10 guide •Represents the order The in which pairs of tree is imprecise sequences are to be aligned and is NOT the tree which •similar sequences are neighbors in the truly describes the tree •distant sequences are distant from each A relationship between the other in the tree sequences! B C D 75 Third step: Align most similar pairs A B C D Align the alignments as if each of them was a single sequence (replace with a single consensus sequence or use a profile) 76 Alignment of alignments M Q T F L H T W L Q S W X Y M L L L M Q H Q - T T S T T I I F W W F W L T I F M T I W 77 Iterative alignment A B C D Pairwise distance table A B C A Guide tree B 11 C 3 1 D 2 2 D Iterate until the MSA doesn’t change 10 MSA A B C D 78 Searching for remote homologs Sometimes BLAST isn’t enough. Large protein family, and BLAST only gives close members. We want more distant members PSI-BLAST Profile HMMs 79 PSI-BLAST Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results 80 PSI-BLAST Advantage: PSI-BLAST looks for seq.s that are close to ours, and learns from them to extend the circle of friends Disadvantage: if we found a WRONG sequence, we will get to unrelated sequences (contamination). This gets worse and worse each iteration 81 Profile HMM Similar to PSI-BLAST: also uses a profile Takes into account: Dependence among sites (if site n is conserved, it is likely that site n+1 is conserved part of a domain The probability of a certain column in an alignment 82 PSI BLAST vs profile HMM PSI BLAST Profile HMM Less exact More exact Faster Slower 83 Case study: Using homology searching The human kinome 84 Kinases and phosphatases 85 Multi-tasking enzymes Signal transduction Metabolism Transcription Cell-cycle Differentiation Function of nervous and immune system … And more 86 How many kinases in the human genome? 1970’s, advent of cloning and sequencing Predicted: 1000 kinases in the human genome 87 How many kinases in the human genome? 2001 – human genome sequence … As well – databases of Genbank, Swissprot, and dbEST How do we know how many kinases are out there? 88 The human kinome 1. 2. In 2002, Manning, Whyte, Martinez, Hunter and Sudarsanam set out to: Search and cross-reference all these databases for all kinases Characterize all found kinases 89 ePKs and aPKs Eukaryotic protein kinases (majority) Sequence homology of the catalytic domain; additional regulatory domains are non-homologous Atypical protein kinases No sequence homology to ePKs; some aPK subfamilies have structural 90 similarity to ePKs The search Several profiles were built: based on the catalytic domain of (a) 70 known ePKs (b) each subfamily of known aPKs HMM-profile searches and PSI-BLAST searches were performed 91 The result… 478 apKs 40 ePKs Total of 518 kinases in the human genome (half of the prediction in the 1970’s) 92 Classifying the kinases 1. 2. Classification based on the catalytic domain Classification based on the regulatory domains 189 sub-families of kinases 93 Comparison to other species 209 subfamilies of ePKs in human, worm, yeast and fly 94 The human genome has x2 kinases (in number) as fly or worm. Many are aPKs. Most of them are receptor tyrosine kinases (RTKs) Nervous system Immune system Hemopoiesis 95 The discovery of new kinases: a new front for battling human diseases 96 Correlating with human diseases 160 kinases mapped to amplicons seen in tumors 80 kinases mapped to amplicons in other major illnesses Usually kinases are over-expressed in cancer and other diseases 97 Correlating with human diseases 6 kinase inhibitors have been approved till today for the use against cancer >70 other inhibitors are in clinical trial 98