Download search1

Sequence based searching Lesson 7 Based on a presentation by Irit Gat-Viks Based on presentation by Amir Mitchel, Introduction to bioinformatics course, Bioinformatics unit, Tel Aviv University. Reminder – Importance of Homology Use a sequence as a search query in order to find homologous sequences in a database. Homology – similarity between sequences that results from a common ancestor. Basic Assumption: Sequence homology → similar structure/function Why ? • Characterizing an ORF. • Finding duplicate genes in the same organism (known function, variants) • Finding homologues genes in other organisms (phylogeny, known function) Study a sequence through homologs Identity Similarity Homology Query= uniprot|Q9UP52|TFR2_HUMAN Transferrin receptor protein 2 >gi|20140567|sp|Q07891|TFR1_CRIGR (Trfr) (TfR2). Transferrin receptor protein 1 (TfR1) (TR) (TfR) Length = 757 Score = 540 bits (1392), Expect = e-152 Identities = 305/727 (41%), Positives = 412/727 (56%), Gaps = 52/727 (7%) Query: 87 LTALLIFTGAFLLGYVAF--RGSCQAC--------GDSVLVVSEDVNYEPDLDFHQGRLY 136 + ++ F F++GY+ + R + C G+S ++ E++ RLY Sbjct: 71 IAVVIFFLIGFMIGYLGYCKRTEQKDCVRLAETETGNSEIIQEENIP-------QSSRLY 123 Query: 137 WSDLQAMFLQFLGEGRLEDTIRQTSLRERVAGSAGMAALTQDIRAALSRQKLDHVWTDTH 196 W+DL+ + + L DTI+Q S R AGS L I KL VW D H Sbjct: 124 WADLKKLLSEKLDAIEFTDTIKQLSQTSREAGSQKDENLAYYIENQFRDFKLSKVWRDEH 183 Query: 197 YVGLQFPDPAHPNTLHWVDEAGKVGEQLPLEDPDVYCPYSAIGNVTGELVYAHYGRPEDL 256 YV +Q A N + ++ G + +E+P Y YS V+G+L++A++G +D Sbjct: 184 YVKIQVKGSAAQNAVTIINVNG---DSDLVENPGGYVAYSKATTVSGKLIHANFGTKKDF 240 Query: 257 QDLRAXXXXXXXXXXXXXXXXISFAQKVTNAQDFGAQGVLIYPEPADFSQDPPKPSLSSQ 316 +DL+ I+FA+KV NAQ F A GVLIY + F P + ++ Sbjct: 241 EDLK---YPVNGSLVIVRAGKITFAEKVANAQSFNAIGVLIYMDQTKF------PVVEAE 291 Query: 317 QAVYGHVHLGTGDPYTPGFPSFNQTQFPPVASSGLPSIPAQPISADIASRLLRKLKGPVA 376 +++GH HLGTGDPYTPGFPSFN TQFPP SSGLPSIP Q IS A +L + ++ Sbjct: 292 LSLFGHAHLGTGDPYTPGFPSFNHTQFPPSQSSGLPSIPVQTISRKAAEKLFQNMETNCP 351 • For Proteins, finding distant relatives is a difficult task. • Distant protein family members, may share <20% amino acid identity(!). >gi|3582021|emb|CAA70575.1| cytochrome P450 [Nepeta racemosa] Length = 509 Score = 405 bits (1043), Expect = e-111 Identities = 94/479 (19%), Positives = 192/479 (40%), Gaps = 35/479 (7%) Query: 61 NLYHFWRETGTHKVHLHHVQNFQKYGPIYREKLGNVESVYVIDPEDVALLFKSEGPNPER NL+ G + H + ++YGP+ + G+V + PE + K++ Sbjct: 45 NLHQL----GLY-PHRYLQSLSRRYGPLMQLHFGSVPVLVASSPEAAREIMKNQDIVFSN ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Query: 297 -----DYRGMLYRLLGDSK----MSFEDIKANVTEMLAGGVDTTSMTLQWHLYEMARNLK D+ +L + ++K + + +KA + +M G DTT+ L+W + E+ +N + Sbjct: 271 GDGALDFVDILLQFQRENKNRSPVEDDTVKALILDMFVAGTDTTATALEWAVAELIKNPR 120 99 347 330 Query: 348 VQDMLRAEVLAARHQAQGDMATMLQLVPLLKASIKETLRLH-PISVTLQRYLVNDLVLRD 406 L+ EV L+ +P LKASIKE+LRLH P+ + + R D + Sbjct: 331 AMKRLQNEVREVAGSKAEIEEEDLEKMPYLKASIKESLRLHVPVVLLVPRESTRDTNVLG 390 Query: 407 YMIPAKTLVQVAIYALGREPTFFFDPENFDPTRWLSK--DKNITYFRNLGFGWGVRQCLG 464 Y I + T V + +A+ R+P+ + +PE F P R+L D +F L FG G R C G Sbjct: 391 YDIASGTRVLINAWAIARDPSVWENPEEFLPERFLDSSIDYKGLHFELLPFGAGRRGCPG 450 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Query types DNA vs. Protein (e.g., UCAUAC or Serine –Tyrosine) • The sequence query can be a nucleotide sequence or an amino acid sequence. • The search is preformed against a nucleotide or amino acid database Which search is preferable? 1. Which sequence is more conserved during evolution? Answer: The genetic code is redundant. Some amino acids are coded by more than one codon. Therefore, the DNA sequence can change while the amino acid sequence will remain the same. 2. Nucleotides: a four letter alphabet. Amino acids: a twenty letter alphabet. Two random DNA sequences will share on average 25% of identity. Two random protein sequences will share on average 5% of identity. 3. Protein comparison matrices are much more sensitive than those for DNA, i.e., similarity relationships are defined between two amino acids (PAM/Blosum). 4. DNA databases are much larger, meaning more random hits. Using the amino acid sequence is preferable for homology search. 1. Protein sequence comparisons typically double the evolutionary look-back time over DNA sequence comparisons. 2. Evolutionary distant proteins will exhibit a high similarity rather than a high identity. 3. Hits can exhibit a long alignment (homology) or a short alignment (conserved domains). Why use a nucleotide sequence after all? Query type • • The sequence query can be a nucleotide sequence or an amino acid sequence. But … we can translate the query sequence! The search is performed against a nucleotide or amino acid database. But … we can use translated databases! (e.g., trEMBL) All types of searches are possible. • Query: DNA Protein Database: DNA Protein Nucleotide query can be translated and searched against protein databases: 1. Translate all reading frames (3 + 3) 2. Find long ORF. • Amino acid query can be back-translated to and searched against nucleotide databases? 1. During translation we lose information. 2. A single amino acid sequence can be back-translated to many possible nucleotide sequences . Query types 1. amino acid query against protein database (blastp) – identifying a protein sequence – finding similar sequences in protein databases. 2. nucleotide query against nucleotide database (blastn) – In non-coding regions (no ORF found)- Identify the query sequence or find similar sequences. – Find primer binding sites or map short contiguous motifs 3. compares translated nucleotide query against protein database. (blastx) – Useful when the query include a coding region, and we try to find homologous proteins. – Used extensively in analyzing EST sequences. This search is more sensitive than nucleotide blast since the comparison is performed at the protein level. 4. protein query against translated nucleotide database (tblastn) – useful for finding protein homologs in unnannotated nucleotide data of coding regions (e.g., ESTs, draft genome records (HTG)). 5. translated nucleotide query against translated nucleotide database. (tblastz) – Useful for identifying novel genes in error prone query sequences. – Used for identifying potential proteins encoded by single pass read ESTs. * six-frame in all translations!!! Searching databases • Naïve solution: Use exact algorithm to compare each sequence in the database to the sequence query. • Problems: 1. 2. • Solutions: 1. 2. • Databases are huge, millions of sequences. Running the computations in parallel is expensive. Use a heuristic algorithm to discard most irrelevant sequences. Perform the alignment on the small group of remaining sequences. Key concept of BLAST (Basic Local Alignment and Search Tool): Homologous sequences are expected to contain ungapped short segments (with substitutions, but without gaps) Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. BLAST • Preprocess – Low complexity regions are removed – A dictionary for K-tuple words is prepared for the query sequence and the database. Protein 3 letter words, DNA 4-6 or even 11 letter words. • Searches for K-tuple words and find database records with common words. Words can be similar, not only identical. – – – – Identity - CAT : CAT Similarity – CAT : CAT, CAR, HAT … But even CAT : ZTX can be similar For each three letter word there are at most 203 similar words. – Similar words are only the ones that have a minimum cut-off score (T). BLAST Stage I • Find matching word pairs • Extend word pairs as much as possible, i.e., as long as the total weight increases • Result: High-scoring Segment Pairs (HSPs) THEFIRSTLINIHAVEADREAMESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEWASNINETEEN BLAST Stage II • Try to connect HSPs by aligning the sequences in between them: THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD INVIEIAMDEADMEATTNAMHEW___ASNINETEEN BLAST Versions of the program [t]BLAST[x/n/p] t : Translate a DNA database in all 6 reading frames for comparison with a Protein query. x : Translate a nucleotide query in all 6 reading frames for comparison with a Protein database. p : Comparison is against a Protein database. n : Comparison is against a Nucleotide database. Masking low complexity • There is one frequent case where the random models and therefore the statistics discussed here break down: regions with highly biased amino acid composition ("low complexity" regions). • Alignments of two regions with similarly biased composition may achieve very high scores that owe virtually nothing to residue order but are due to segment composition. • Usually generated by slippage and thus not interesting. The BLAST programs employ the SEG (protein) DUST(DNA) algorithm to filter low complexity regions from proteins before executing a database search. •Masking is practiced on the query sequence only, not on the database sequences! BLAST http://www.ncbi.nlm.nih.gov/ BLAST http://www.ncbi.nlm.nih.gov/BLAST/ BLAST …or even construct my own BLAST searchable database by an Entrez query Mask According to the case within the query sequence. I can limit my search to a selected organism Mask for lookup table hit search stage, but NOT for the hit extension stage. Filter Low Complexity regions by SEG or DUST BLAST Lineage Report root . Bilateria [animals] . . Coelomata [animals] . . . Euteleostomi [vertebrates] . . . . Tetrapoda [vertebrates] . . . . . Eutheria [mammals] . . [mammals] . . . Homo sapiens (man) acid -----------18 hitsacid [mammals] retino 18 .hits retinoic induced 3;571 retinoic responsive gene [H . . [mammals] . . . Mus musculus (mouse) .......... 15 hits [mammals] retino 15 .hits retinoic acid inducible 432 protein 3 [Mus musculus] . . [mammals] . . . Rattus norvegicus (brown rat) acid . 411 5 hits protein [mammals] 5 .hits similar to retinoic inducible 3 [Rattussimila norv . . [amphibians] . . Xenopus laevis (clawed frog) [Xenopus ---- 216laevis] 1 hit [amphibians] MGC687 1 .hit MGC68729 protein . . [bony . Takifugu rubripes (torafugu) -----40 4rubripes] hits [bony fishes] pherom 4 .hits fishes] pheromone receptor [Takifugu . . [flies] Drosophila melanogaster ------------48 4 hits [flies] CG8285 4 .hits CG8285-PA [Drosophila melanogaster] >gi|2827758|sp|P22815 . . [flies] Drosophila virilis .................. 39 precursor 1 hit [flies] Bride 1 .hit Bride of sevenless protein >gi|1079166|pir||A47 . . [flies] Anopheles gambiae str. PEST ......... 38 1 gambiae] hit [flies] ENSANG 1 .hit ENSANGP00000013404 [Anopheles >gi|21296536|gb|EA . Caenorhabditis ---------------2 hitsto[nematodes] calciu 2 .hits [nematodes] elegans calcium-sensing receptor,41similar human metabotropic g environmental sequence -----------------40 2 hits [unclassified] unknow 2 .hits [unclassified] unknown [environmental sequence] BLAST BLAST BLAST The statistics of sequence similarity scores • Bits score – A score for the alignments according to the number of similarities, identities, etc. • Expected-score (E-value) (of an alignment having a score S): The number of times one expects to find alignments with a score >= S of a random sequence Vs. a random database. (having the same lengths and compositions). The closer the e-value approaches zero, the greater the confidence that the match is real (from zero to one). BLAST What about: •Short sequences? •large sequences and queries? Short sequences: Parameter settings for standard blastp and "Search for short and nearly exact matches" Word Size SEG Filter Standard Blastp 3 On 10 BLOSUM62 Search for short and nearly exact matches 2 Off 20000 PAM30 Program Expect Score Matrix Value Parameter settings for standard blastn and "Search for short and nearly exact matches" Word Size DUST Filter Expect Value Standard blastn 11 On 10 Search for short nearly exact matches 7 Off 1000 Program PAM vs. BLUSOM- reminder – Different BLOSUM matrices are derived from blocks with different identity percentage. (e.g., blosum62 is derived from an alignment of sequences that share at least 62% identity.) Larger n  smaller evolutionary distance. – Single PAM was constructed from at least 85% identity dataset. Different PAM matrices were computationally derived from it. Larger n  larger evolutionary distance Observed % Difference 1 10 20 30 40 50 60 70 80 Evolutionary distance (PAM) 1 11 23 38 56 80 120 120 159 250 250 BLOSUM 99 90 80 70 60 50 40 30 20 62 How to generate results from large sequences and queries? 1. Some sequences contain large regions of ALU repeats. In this case you can select the "Human Repeat" filtering option on the main BLAST search page. This will mask repeat regions which generate a large number of biologically uninteresting hits to the databases. 2. Increase the Word Size to 20 - 25. With a default Word Size of 7, limiting the number small initial fragments to be extended to HSPs. 3. Decrease the Expect value to 1.0 or lower  eliminates many hits and concentrate on results which are more likely to contain large coding regions and genomic fragments. 4. Processing multiple query sequences in one run can be much faster than processing them with separate runs because the database is scanned only 1 time for the entire set of queries. PSI-BLAST (Position-Specific Iterated (PSI)-BLAST ) Sensitive protein-protein similarity searches. • The most sensitive BLAST program, making it useful for finding very distantly related proteins. • Use PSI-BLAST when your standard protein-protein BLAST search failed to find significant hits. Algorithm: • The first round of PSI-BLAST is a standard protein-protein BLAST search. The program builds a position-specific scoring matrix (PSSM or profile) from an alignment of the sequences returned with Expect values better (lower) than the inclusion threshold (default=0.005). • The PSSM will be used to evaluate the alignment in the next iteration of search. Any new database hits below the inclusion finding very distantly related proteins.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download search1