* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Database Searches for similar sequences
Transposable element wikipedia , lookup
Genetic code wikipedia , lookup
DNA barcoding wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genomic library wikipedia , lookup
Non-coding DNA wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Human genome wikipedia , lookup
Point mutation wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Metagenomics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome editing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Bioinformatics s lie Ann mi ota Fa tion n tei o r P Genome Annotation • Annotation in bioinformatics: Function, intron-exon-boundaries, regulatory sequences, repeats, gene names and protein products, etc. • This information is obtained by searching similar sequences in databases. Expression Compare Molecule ? ains Dom Compare Similar Proteins? Expression TLR3 Orthology/paralogy/homology TLR1 TLR1 TLR1 TLR1 TLR2 TLR2 TLR2 TLR2 TLR Orthologous genes are homologous (corresponding) genes in different species Paralogous genes are homologous genes within the same species (genome) Database Searches for similar sequences How can we search for domains in a genome? MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLI LRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPP STLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQD YNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNV GRTLADYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESS How can we search for similar sequences in a database? Alignment Alignment HMM HMM Search What if we don't have a multiple alignment to start with? MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLI LRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPP STLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQD YNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNV GRTLADYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESS MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLI LRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPP STLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQD YNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNV GRTLADYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESS Global versus local alignments Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm). Global alignment Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm). Seq 1 Local alignment Seq 2 We can search by aligning query to every single protein sequence in genomes! Search Local alignment overview Local alignment: example • The recursive formula is changed by adding a fourth possibility: zero. This means local alignment scores are never negative. H(i,j) = max H(i-1,j-1) + S(xi,yj) diagonal H(i,j-1) – gap-penalty horizontal H(i-1,j) – gap-penalty vertical 0 • Trace-back is started at the highest value rather than in lower right corner • Trace-back is stopped as soon as a zero is encountered Local alignment: example But this is too slow! • consider the task of searching SWISS-PROT against a query sequence: – say our query sequence is 362 amino-acids long – SWISS-PROT release 55 (18-Mar-08) contains 129,199,355 amino acids X • finding local alignments for this query via dynamic programming would entail approx. 5 x 1010 matrix operations • many servers handle thousands of such queries a day (NCBI > 50,000) Sequence database searching • Local alignments Too slow for repeated database searches Heuristic Alignment • Heuristic alignment: An alignment that is pretty good, but you can not prove that it is best. • BLAST: have tricks that people have come • BLAST • PSI-BLAST Fast methods BLAST • Basic Local Alignment Search Tool • Aim: find as much as possible good matches with reasonable speed • Let's go through step by step to learn how BLAST works! up with to make alignment and database searching fast, without losing too much quality. Blast, I: Indexing • The program makes an index by dividing every sequence in the database to words of a defined size (W). – Default W=11 DNA sequences – Protein sequences the default W=3 • When we run BLAST it creates words from our query as well Database sequence (N = 46 bases) Blast II, Initial Searching GTGTCAGCTAACGGCCGTTACGATGCTAAAGCTATACGATTAGCG • Each word in the query sequence is compared to the database index and residue pairs are scored Words (W = 11) GTGTCAGCTAA – For DNA sequences a match is +1, a mismatch is – 3 – For protein sequences, scores for matches and mismatches are based on a substitution matrix TGTCAGCTAAC GTCAGCTAACG TCAGCTAACGG CAGCTAACCGC etc. Query sequence (N=27 bases) TCATATCACGGCCCTTCGGACCTGAGG TCATATCACGG • The score for each word pair is the sum of the scores for each pair of residues • Matching words scoring above a threshold (T) are retained for further analysis – DNA: T = 0, Protein: T = 11 CATATCACCGC ATATCACCGCC etc. Database sequence (N = 46 bases) GTGTCAGCTAACGGCCGTTACGATGCTAAAGCTATACGATTAGCG Words (W = 11) Alignments TCAGATCACGG GTGTCAGCTAA | | | TGTCAGCTAAC TGTCAGCTAAC (3 * 1) + (8 * -3) = -21 GTCAGCTAACG TCAGATCACGG TCAGCTAACGG CAGCTAACCGC etc. Query sequence (N=27 bases) | | GTCAGCTAACG (2 * 1) + (9 * -3) = -25 TCAGATCACGGCCCTTCGGACCTGAGG TCAGATCACGG TCAGATCACGG |||| | |||| CAGATCACCGC AGATCACCGCC TCAGCTAACGG (9 * 1) + (2 * -3) = 3 etc. Blast III, Extending Hits • extend hits in both directions (with or without allowing gaps) • Residues will be added until the incremental score drops below a threshold (S) Alignments TCAGATCACGG |||| | |||| TCAGCTAACGG (9 * 1) + (2 * -3) = 3 Extension TCAGATCACGGC |||| | ||||| TCAGCTAACGGC (10 * 1) + (2 * -3) = 4 TCAGATCACGGCC |||| | |||||| TCAGCTAACGGCC (11 * 1) + (2 * -3) = 5 • Stretches of similar regions are called HSPs (high scoring segment pairs) BLAST Notes • may fail to find all HSPs – may miss seeds if T is too stringent • 10 to 50 times faster than Smith-Waterman • large impact: – NCBI’s BLAST server handles more than 50,000 queries a day – most used bioinformatics program • The T parameter is the most important for the speed and quality of the search for HSPs: small T: more hits to expand, more False Positives large T: fewer hits to expand, fewer False Positives TCAGATCACGGCCCAACGGACCTGAGG |||| | |||||| || | TCAGCTAACGGCCGTTACGATGCTAAA (11 - 39) = -28 BLAST ‘flavors’ • blastp compares an amino acid query sequence against a protein sequence database • blastn compares a nucleotide query sequence against a nucleotide sequence database • blastx compares the six-frame protein translation products of a nucleotide query sequence against a protein sequence database • tblastn compares a protein query sequence against a nucleotide sequence database translated in six reading frames • tblastx searches translated nucleotide database using a translated nucleotide query BLAST ‘flavors’ for nucleotide sequences 1 - This portion of each description links to the sequence record for a particular hit. 2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence). 3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment. For example, the first alignment has a very low E value of e-117 meaning that a sequence with a similar score is very unlikely to occur simply by chance. 4 - These links provide the user with direct access from BLAST results to related entries in other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's Molecular Modeling DataBase. When is a database hit significant? • Problem: – Even unrelated sequences can be aligned (yielding a low score) and thus can give a BLAST hit – How do we know if a database hit is meaningful? – When is an alignment score sufficiently high? • Solution: – Determine the range of alignment scores you would expect to get for random reasons (i.e., when aligning unrelated sequences). – Compare actual scores to the distribution of random scores. – Is the real score much higher than you’d expect by chance? • Megablast is specifically designed to efficiently find long alignments between very similar sequences and thus is the best tool to use to find the identical match to your query sequence (W=28) • discontiguous MEGABLAST is better at finding nucleotide sequences similar, but not identical, to your nucleotide query. The search is focused on first and second codons. Database searching: E-values in BLAST BLAST uses precomputed distributions to calculate the chance that the hit found could be due to random reasons: Correction for subs. matrix Score Correction for search space Size of query seq. Database size A word of caution: E-values in BLAST Low complexity regions Plasmodium falciparum BLAST tends to overestimate the significance of its matches. E-values from BLAST are fine for identifying sure hits. One should be careful using BLAST’s E-values to judge if a marginal hit can be trusted. You may want to use E-values of 10-4 to 10-5 >SERA_PLAFG (P13823): MKSYISLFFILCVIFNKNVIKCTGESQTGNTGGGQAGNTVGDQAGSTGGSPQGSTGASQPGSSEPSNPVSSGHSVSTVSV SQTSTSSEKQDTIQVKSALLKDYMGLKVTGPCNENFIMFLVPHIYIDVDTEDTNIELRTTLKETNNAISFESNSGSLEKK KYVKLPSNGTTGEQGSSTGTVRGDTEPISDSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSESLPANGPDSPTVKP PRNLQNICETGKNFKLVVYIKENTLIIKWKVYGETKDTTENNKVDVRKYLINEKETPFTSILIHAYKEHNGTNLIESKNY ALGSDIPEKCDTLASNCFLSGNFNIEKCFQCALLVEKENKNDVCYKYLSEDIVSNFKEIKAETEDDDEDDYTEYKLTESI DNILVKMFKTNENNDKSELIKLEEVDDSLKLELMNYCSLLKDVDTTGTLDNYGMGNEMDIFNNLKRLLIYHSEENINTLK NKFRNAAVCLKNVDDWIVNKRGLVLPELNYDLEYFNEHLYNDKNSPEDKDNKGKGVVHVDTTLEKEDTLSYDNSDNMFCN KEYCNRLKDENNCISNLQVEDQGNCDTSWIFASKYHLETIRCMKGYEPTKISALYVANCYKGEHKDRCDEGSSPMEFLQI IEDYGFLPAESNYPYNYVKVGEQCPKVEDHWMNLWDNGKILHNKNEPNSLDGKGYTAYESERFHDNMDAFVKIIKTEVMN KGSVIAYIKAENVMGYEFSGKKVQNLCGDDTADHAVNIVGYGNYVNSEGEKKSYWIVRNSWGPYWGDEGYFKVDMYGPTH CHFNFIHSVVIFNVDLPMNNKTTKKESKIYDYYLKASPEFYHNLYFKNFNVGKKNLFSEKEDNENNKKLGNNYIIFGQDT AGSGQSGKESNTALESAGTSNEVSERVHVYHILKHIKDGKIRMGMRKYIDTQDVNKKHSCTRSYAFNPENYEKCVNLCNV •These regions help evolving fast? •Result of recombination events? •Replication mistakes? They make database searches difficult! PSI (Position Specific Iterated) BLAST • basic idea – use results from BLAST query to construct a HMM (without insertions and deletions) – search database with this HMM “Small letters” denote low-complexity sequence fragments that are ignored Orthology/paralogy PSI-BLAST iteration Q xxxxxxxxxxxxxxxxx Q xxxxxxxxxxxxxxxxx Query sequence BLAST search Query sequence Database hits iterate A C D . . . . . Y TLR1 TLR1 TLR2 TLR1 TLR2 TLR2 HMM TLR1 HMM search TLR2 TLR Database hits Operational definition of orthology Bi-directional best hit: • Blast gene A in genome 1 against genome 2: gene B is best hit • Blast gene B against genome 1: if gene A is best hit à A and B are orthologous A number of other criteria is also in use (part of which is based on phylogeny) Impact of using PSI-BLAST Purple sea urchin genome is available (November 2006) Does sea urchin have Toll-like receptors?