Download Dotplot: Et protein oppbygd av moduler som ligner hverandre

Valg av poengverdier (substitusjonsmatrise) er viktig  Scoring matrices appear in all analysis involving sequence comparison.  The choice of matrix can strongly influence the outcome of the analysis.  Scoring matrices implicitly represent a particular theory of evolution.  Understanding theories underlying a given scoring matrix can aid in making proper choice. Forskjellige prinsipper for substitusjonsmatriser     Identity matrix Genetic Code Matrix: Score based on minimum number of base changes required to convert one amino acid into another. Physical/ chemical characteristics. Attempt to quantify some physical or chemical attribute of the residues and arbitrarily assign weights based on similarities of the residues Log odds matrices S is the log odds ratio of two probabilities: the probability that two residues, i and j, are aligned by evolutionary descent and the probability that they are aligned by chance. qij are the frequencies that residue i and j are observed to align in sequences known to be related. They are derived from a "transition probability matrix.” pi and pj are frequencies of occurrence of residue i and j in the set of sequences. e. g., PAM250, BLOSUM62 et al. PAM-matriser: Hvordan ble de konstruert av Margaret Dayhoff? Align sequences that are at least 85% identical (minimize ambiguity in alignments, minimize the number of coincident mutations. 2. Reconstruct phylogenetic trees and infer ancestral sequences. 71 trees containing 1,572 exchanges were used. 3. Count replacements "accepted" by natural selection, in all pairwise comparisons (each Aij is the number of times amino acid j was replaced by amino acid i in all comparisons). 4. Compute amino acid mutability mj , i. e., the propensity of a given amino acid, j, to be replaced. 1. PAM-konstruksjon, forts. 5. Combine data from 3 & 4 to produce a Mutation Probability Matrix for one PAM of evolutionary distance (1 PAM (Accepted Point Mutation per 100 residues)), according to the following formulae: 6. Calculate Log Odds Matrix for similarity scoring: Divide each element of the Mutation Data Matrix, M, by the frequency of occurrence of each residue: R is a Relatedness Odds Matrix , fi is the frequency of residue i. The Log Odds Matrix, Sij, is calculated from the relatedness odds matrix, Rij, simply by taking the log of each Rij and multiplying with 10 PAM 250 substitution matrix Limitations of the PAM model Assumptions in PAM model: 1. replacement at any site depends only on the amino acid at that site and the probability given by the table (Markov model). 2. sequences that are being compared have average amino acid composition. Sources of error in PAM model 1. Many sequences depart from average composition. 2. Rare replacements were observed too infrequently to resolve relative probabilities accurately (for 36 pairs no replacements were observed!). 3. Errors in 1 PAM are magnified in the extrapolation to 250 PAM. 4. The Markov process is an imperfect representation of evolution: Distantly related sequences usually have islands (blocks) of conserved residues. This implies that replacement is not equally probable over entire sequence. BLOSUM (Blocks Substitution Matrix) substitusjonsmatriser 1. Starting data is conserved blocks from Blocks database.  aligned, ungapped sequences  widely varying similarity, but measures are taken to avoid biasing the sample with frequently occurring highly related sequences. 2. Counts of replacements are made by straight forward counting of all pairs of aligned residues, fij  The observed frequency of each pair is: qij= fij/( total number of residue pairs)  This includes cases of i= j (i. e. no replacement observed).  The expected frequency of each pair is essentially the product of the frequencies of each residue in the data set. BLOSUM (Blocks Substitution Matrix) substitusjonsmatriser 3. Similar sequences in a block above a threshold percent similarity are clustered and members of the cluster count fractionally toward the final tally. – Reduces the number of identical pairs (AA, SS, TT, etc., matches) in the final tallies. – Somewhat analogous to increasing the PAM distance. – If clustering threshold is 80%, final matrix is BLOSUM 80. – Clustering at 62% reduces the number of blocks contributing to the table by 25%- still 1.25 x 10^ 6 pairs contributed! – Least frequent amino acid pair replacement was observed 2369 times! BLOSUM 62 Blosum og PAM – en sammenligning FASTA og BLAST: søk etter beslektede sekvenser i databasene Søk i databasene med en rigorøs Smith-Watermanalgoritme er ressurskrevende (men mulig). FASTA og BLAST gir raskere søk og mindre ressursbruk ved å benytte snarveier. For begge gjelder det at det foretas en forhånds-”siling” av sekvensene i databasen slik at bare sekvenser som ser interessante ut (ser ut til å ligne på søkesekvensen) behandles videre Slik arbeider FASTA s = 1 2 3 4 5 6 7 8 9 10 11 H A R F Y A A Q I V L Ktup= 1 A 2, 6, 7 F 4 H 1 I 9 L 11 Q 8 R 3 V 10 Y 5 others... -7 –6 1 –5 –4 1 V t = –2 –1 1 2 1 3 M 4 A 5 A 6 Q 7 I 8 A +9 Hash table –3 2 D -2 -3 +2 +1 +2 +2 -6 +3 +2 -2 -1 0 +1 +2 +3 1 4 1 Offset vector +4 +5 +6 +7 +8 +9 +10 1 From: G.J .Barton: Protein Sequence Alignment and Database Scanning in Protein Structure prediction - a practical approach, Edited by M. J. E. Sternberg, IRL Press at Oxford University Press, 1996 FASTA, forts. FASTA vil så koble samme to eller flere k-tupler dersom de ikke ligger for langt fra hverandre, disse utgjør sammen en region. Kan ses på som en lokal sammenstilling uten gap. De 5 beste regionene fra forrige fase poengsettes så på ny med PAM120 eller PAM250. Dette er første mål på likhet mellom r og s og kalles initial score i resultatfilen. En slik regnes ut for alle sekvenser i databasen. Optimized score regnes så ut a la Smith-Waterman, men begrenset til ruter i et bånd rundt utgangssammenstillingen FASTA – valg av k-tuple-verdi For DNA-søk er ktup 4-6, for proteinsøk 1eller 2. Valg av ktup har innvirkning på resultatet:  Lav ktup øker sensitiviteten, dvs. evnen til å finne fjerne slektninger  Høy ktup øker selektiviteten, dvs. evnen til å forkaste falske positiver Varianter av FASTA PROGRAM FUNCTION fasta3 scan fastx/y3 compare tfastx/y3 compares fasts3 compares fastf3 compares a protein or DNA sequence library for similar sequences a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames. a protein to a translated DNA data bank linked peptides to a protein databank databank mixed peptides to a protein FASTA-resultater Parametere som sier noe om hvor gode våre databasetreff er  Init1: score of the highest scoring initial region  Initn: sum of initial scores of joined regions minus joining penalty for each gap  opt: score of optimal alignment of the region  Z: measure of how unusual the original match is. If score=S, Z=(S-mean)/sd  P: probability that the alignment is no better than random  E(n): expected number of sequences giving the same z-score or better if the database is probed with a random sequence. E=P*(database size n) Vurdering av resultater  Z-score > 5: significant  P < 10-100: eksakt treff 10-100 < P < 10-50: nesten identiske sekvenser 10-50 < P < 10-10: nær beslektede, sikker homologi 10-5 < P < 10-1: vanligvis fjerne slektninger P > 10-1: Trolig ikke signifikant treff  E < 0.02: Trolig homologe sekvenser 0.02 < E < 1: homologi kan ikke utelukkes E > 1: tilfeldig? Slik virker BLAST (Basic Local Alignment Search Tool)      Blast lager en liste over alle tretegns-ord (words, delsekvenser) i søkeproteinet (for sekvensen MEFGALLY.. blir de MEF, EFG, FGA, GAL osv.) Ved bruk av BLOSUM62 identifiseres for hvert av disse ordene ord som gir en score over en viss grenseverdi (neighborhood word score threshold) (ca. 50 nye ord for hvert utgangsord Hver sekvens i databasen gjennomsøkes så for eksakte treff med hvert av de 50 ordene for hver posisjon i søkesekvensen Treffene utvides så til poengsummen begynner å bli lavere. Resultatet er et lengre sammenstilte sekvensstrekk kalt HSP (high-scoring segment pair). Sammenkobling av HSP med egnet plassering. From: G.J .Barton: Protein Sequence Alignment and Database Scanning in Protein Structure prediction - a practical approach, Edited by M. J. E. Sternberg, IRL Press at Oxford University Press, 1996 BLAST-resultater BLAST-resultater, fortsatt Varianter av Blast       blastp compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive Psi-blast - Position Specific Iterated BLAST uses an iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity Det humane genom Horizontal gene transfer? Probable vertebrate-specific acquisition of bacterial genes Men nei…. Men nei, fortsatt Fylogenetisk analyse Hva gikk feil? ”A different methodological reason for several of the genes in the human genome report being considered as bacteria±vertebrate HGTs, was that phylogenetics was not the analytical approach, and that the conclusions were instead derived largely from top BLAST hit results. In several instances the top BLAST hit was indeed a bacterial species, whereas further down the list of significant BLAST hits one finds a nonvertebrate eukaryote. When such sequences were properly aligned, the resulting phylogenetic trees often supported the monophyly of eukaryotes with the nonvertebrate eukaryote at the base.” ClustalW-sammestilling * 20 * 40 * 60 Human : ------------------------------------------------------------- : Termotoga : --------------------------------------MMSGHNKWANIKHRKMAQDAKKS : C.elegans : MFSPLRRLTTTGLQLQKLQKLQKLQQFQPARAVHLTVFQQKGHSKWQNIKAVKGKNDLIRS : gh kw nik k d s 23 61 * 80 * 100 * 120 Human : ------------------------------------------------------------- : Termotoga : KIFTKLIREIIVAAREGGGNIETNPRLRAAVERARAENMPKENIERAIKRGTGELEGVDYQ : 84 C.elegans : KATNFLLRKVRGAVSRGGFDMKLNRELADLESEFRAQGLPLDTLKNFLQKMKDKPE----V : 118 k l r a gg n l ra p e * 140 * 160 * 180 Human : ------------------MNKNGGVMAVGARHSFDKKG-VIVVEVEDR-----EKKAVNLE : 37 Termotoga : EVIYEGYAPGGVAVYIRALTDNKNRTAQELRHLFNKYG-GSLAESGSVSWIFERKGVIEIS : 144 C.elegans : EYSFDIIGPSGIFLIVTAETSNKKAFENDLRKYFNKLGGFRLAADGGVRSWFEEKGVVHVD : 179 e p g a t Nk a lRh F1K G 6ae g v feeKgv6 6 * 200 * 220 * 240 Human : R---ALEMAIEAGAEDVKETEDEEER-------NVFKFICDASSLHQVRKKLDSLGLCSVS : 88 Termotoga : R---DKVKDLEELMMIAIDAGAEDIKDAE----DPIQIITAPENLSEVKSKLEEAG-YEVE : 197 C.elegans : TKKGGKILNIEEMEEIGLEFDAEEVLLIEEDSTKKFELICDAKSLQTLENGLGKGGFSILQ : 240 r k 6Ee ei e aEe e f Icda sL 6 kL G 6 * 260 * 280 * 300 Human : CALEFIPNSKVQLAEPDLEQAAHLIQALSNHEDVIHVYDNIE--------------- : 130 Termotoga : AKVTFIPKNTVKVTGKDAEKVLEFLNALEDMDDVQEVYSNFEMDDKEMEEILSRLEG : 254 C.elegans : SEIEFRPVHPIDCPEAEEPKVQKLYEMLQEDEQVRQIFDNITPDE------------ : 285 6eFiP 6 e d ekv l aL edV 65dNie d Konklusjonen ”Most of our analyses and phylogenetic topologies are highly consistent with the view that vertebrates and bacteria share these loci through common ancestry, involving a succession of non-vertebrate eukaryote intermediates. A further point arising from our analysis is that the evolutionary relation-ships among proteins cannot be concluded solely from the ranking of database hits in homology searches (for example, BLAST reports). This is not a new conceptual point (see refs 7, 12, 13), but one that seems to have been overlooked in this instance. Phylogenetic analysis must be a central component of any protein family or genome annotation effort. Importantly, phylogenetic reconstruction is critical to synthesizing, from the growing wealth of sequence data, a more

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Dotplot: Et protein oppbygd av moduler som ligner hverandre