* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download w0506_tutorial3_06
Silencer (genetics) wikipedia , lookup
Magnesium transporter wikipedia , lookup
Gene expression wikipedia , lookup
Protein (nutrient) wikipedia , lookup
Protein moonlighting wikipedia , lookup
List of types of proteins wikipedia , lookup
Western blot wikipedia , lookup
Non-coding DNA wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Genomic library wikipedia , lookup
Protein adsorption wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Genetic code wikipedia , lookup
Proteolysis wikipedia , lookup
Protein structure prediction wikipedia , lookup
Molecular evolution wikipedia , lookup
Point mutation wikipedia , lookup
Structural alignment wikipedia , lookup
Introduction to Bioinformatics Tutorial no. 2 BLAST BLAST BLAST – Outline Sequence Alignment Complexity and indexing BLASTN and BLASTP Basic parameters PAM and BLOSUM matrices Affine gap model E Values (once again) Advanced BLAST Databases BLAST options BLAST output Taxonomic BLAST Pairwise BLAST BLAST Variations Name Query type Database blastn Genomic Genomic blastp Protein Protein blastx Translated genomic tblastn Protein Protein Translated genomic tblastx Translated genomic Translated genomic Genomic translations test all 6 possibilities: 3x for codon frames, 2x for reverse complement BLASTN Databases nr GenBank, EMBL, DDBJ, PDB and NCBI reference sequences (RefSeq) htgs High-throughput genomic sequences (draft) pat Patented nucleotide sequences mito Mitochondrial sequences vector Vector subset of GenBank month GenBank, EMBL, DDBJ, PDB from 30 days chrom Contigs and chromosomes from RefSeq BLASTP Databases nr GenBank CDS translations, RefSeq, PDB, SWISS-PROT, PIR, PRF swissprot SWISS-PROT pat Patented protein sequences pdb Protein Data Bank month GenBank CDS translations, PDB, SWISS-PROT, PIR, PRF from 30 days BLASTN/P Options (1) Only search part of database using NCBI Entrez query format Search specific organism Remove low information content, e.g. short repeats or rich in only 2 nucleotides Remove known human repeats (LINEs, SINEs) BLASTN/P Options (2) Threshold for results significance Use index based on words of 7, 11 or 15 nucleotides Costs to open and extend gap, score for nucleotide match or mismatch. Allowed gap scores: 10/1, 10/2, 11/1, 8/2, 9/2 BLASTP Options Scoring matrix: PAM, etc… Costs to open and extend gap Search for a motif (PSI-BLAST) BLASTN/P Formatting (1) Show colored bar chart Other (less important) options on what to show Number of sequences listed Number of alignments shown BLASTN/P Formatting (2) How to display alignments Only show results which match Entrez search or are from specific organism Only show results with E values in this range BLASTN Results Query sequence representation Matched areas of database sequences BLAST Output Header Request ID for later retrieval Query sequence details Database details Tax BLAST BLAST Alignments (1) Sequence Identifier Sequence description Score and E value BLAST Alignments (2) Normalized score of alignment Expected number of such hits (2e-11 = 2 10-11) Number of insertion / deletions Number of exact matches Number of matches with positive score BLAST Alignments (3) Insertion / deletion Exact match Query sequence Mismatch with positive score Matched sequence Position within sequence Masked low complexity region Expectation Values Increases linearly with length of query sequence Increases linearly with length of database Decreases exponentially with score of alignment Tax BLAST Lineage of organism with strongest hit Shared ancestry in taxonomic tree Score of organism’s strongest hit Number of organism hits BLAST2SEQ Type of program This tool produces the alignment of two given sequences using BLAST engine for local alignmentScoring . matrix Scoring scheme Gap model, Expect Value, Advanced options GO ! Sequences Sequences Questions You have two query sequences: query1 and query2: >query1 CCGTCCGTCCGTCGTCCTCCTCGCTTGCGGGGCGCCGGGCCCGTCCTCGAGCCCCCNNNNNCCGTCCGGC CGCGTCGGGGCCTCGCCGCGCTCTACCTACCTACCTGGTTGATCCTGCCAGTAGCATATGCTTGTCTCAA AGATTAAGCCATGCATGTCTAAGTACGCACGGCCGGTACAGTGAAACTGCGAATGGCTCATTAAATCAGT TATGGTTCCTTTGGTCGCTCGCTCCTCTCCTACTTGGATAACTGTGGTAATTCTAGAGCTAATACATGCC GACGGGCGCTGACCCCCTTCGCGGGGGGGATGCGTGCATTTATCAGATCAAAACCAACCCGGTCAGCCCC TCTCCGGCCCCGGCCGGGGGGCGGGCCGCGGCGGCTTTGGTGACTCTAGATAACCTCGGGCCGATCGCAC GCCCCCCGTGGCGGCGACGACCCATTCGAACGTCTGCCCTATCAACTTTCGATGGTAGTCGCCGTGCCTA CCATGGTGACCACGGGTGACGGGGAATCAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCAC ATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGACCCGGGGAGGTAGTGACGAAAAATAACAA TACAGGACTCTTTCGAGGCCCTGTAATTGGAATGAGTCCACTTTAAATCCTTTAACGAGGATCCATTGGA GGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGCTGCAGTTAA AAAGCTCGTAGTTGGATCTTGGGAGCGGGCGGGCGGTCCGCCGCGAGGCGAGCCACCGCCCGTCCCCGCC CCTTGCCTCTCGGCGCCCCCTCGATGCTCTTAGCTGAGTGTCCCGCGGGGCCCGAAGCGTTTACTTTGAA AAAATTAGAGTGTTCAAAGCAGGCCCGAGCCGCCTGGATACCGCAGCTAGGAATAATGGAATAGGACCGC GGTTCTATTTTGTTGGTTTTCGGAACTGAGGCCATGATTAAGAGGGACGGCCGGGGGCATTCGTATTGCG CCGCTAGAGGTGAAATTCTTGGACCGGCGCAAGACGGACCAGAGCGAAAGCATTTGCCAAGAATGTTTTC ATTAATCAAGAACGAAAGTCGGAGGTTCGAAGACGATCAGATACCGTCGTAGTTCCGACCATAAACGATG CCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCCGGGCAGCTTCCGGGAAACCAAAGTCTTTGGG TTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAG CCTGCGGCTTAATTTGACTCAACACGGGAAACCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGA TAGCTCTTTCTCGATTCCGTGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGCGATTTGTCTGGTTA ATTCCGATAACGAACGAGACTCTGGCATGCTAACTAGTTACGCGACCCCCGAGCGGTCGGCGTCCCCCAA CTTCTTAGAGGGACAAGTGGCGTTCAGCCACCCGAGATTGAGCAATAACAGGTCTGTGATGCCCTTAGAT GTCCGGGGCTGCACGCGCGCTACACTGACTGGCTCAGCGTGTGCCTACCCTACGCCGGCAGGCGCGGGTA ACCCGTTGAACCCCATTCGTGATGGGGATCGGGGATTGCAATTATTCCCCATGAACGAGGAATTCCCAGT AAGTGCGGGTCATAAGCTTGCGTTGATTAAGTCCCTGCCCTTTGTACACACCGCCCGTCGCTACTACCGA TTGGATGGTTTAGTGAGGCCCTCGGATCGGCCCCGCCGGGGTCGGCCCACGGCCTGGCGGAGCGCTGAGA AGACGGTCGAA Questions >query2 TACGAACGCTGGCGGCATGCTAATACATGCAAGTCGAACGAGACCTTCGGGTCTAGTGGCGCACGGGTGG CTAACGCGTGGGAATCTGCCCTTGGGTTCGGAATAACTTCGGGAAACTGAAGCTAATACCGGATGATGAC GAAAGTCCAAAGATTTATCGCCCAGGGATGAGCCCGCGTAGGATTAGCTAGTTGGTGGGGTAAAGGCTCA CCAAGGCAACGATCCTTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACT CCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTG ATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGAGATGATAATGACAGTATCGGGAGAATAAGCTCCG GCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAG CGCACGTAGGCGGCGATTTAAGTCAGAGGTGAAAGCCCGGGCTCAACCCCGAACTGCCTTTGAGACTGGA TTGCTAGAATCTTGGAGAGGCGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAAC ACCAGTGCGAAGGCGGCTCGCTGGACAAGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGAT TAGATACCCTGGTAGTCCACGCCGTAAACGATGATAACTAGCTGCCGGGGCACATGGTGTTTCGGTGGCG CACGTAACGCATTAAGTTATCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGG GCCTGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCAGCGTTTGACATC CTCATCGCGGATTTCAGAGATGATTTCCTTCAGTTCGGCTGGATGAGTGACAGGTGCTGCATGGCTGTCG TCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTTAGTTGCCAGCAT TTAGTTGGGTACTCTAAAGGAACCGCCGGTGATAAGCCGGAGAAGGTGGGGATGACGTCAAGTCCTCATG GCCCTTACGCGCTGGGCTACACACGTGCTACAATGGCGACTACAGTGGGCTGCAACCGTGCGAGCGGTAG CTAATCTCCAAAAGTCGTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGGCGGAATCGCTAG TAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCAGGCCTTGTACACACCGCCCGTCACACCATGGG ATTTGGATTCACCCGAAGGCACTGCGTTAACCCGCAAGGGAGACAGGTGACCACGGTGGGTTTAGAGACT GGGGTGAA Questions Using BLASTN • Find what do each one of these sequences code for. Questions Questions • • To which organism each sequence is related? Do these sequences code for proteins? Pretend the information for answering previous questions is not available to you could you suggest a way to answer these questions anyway? BLASTX Questions • Look carefully at the e-value column of the first 50 results of each query. What can you learn about these sequences? Are these sequences generally conserved between other organisms? 5 last answers Questions • Use bl2seq to align the two query sequences. What can you say about the relation between them? Based does this last result make sense? Questions You have two query sequences. >query3 ATGTCTGCTCCACAAGCCAAGATTTTGTCTCAAGCTCCAACTGAATTGGAATTACAAGTT GCTCAAGCTTTCGTTGAATTGGAAAATTCTTCTCCAGAATTGAAAGCTGAGTTGAGACCT TTGCAATTCAAGTCCATCAGAGAAGT >query4 GTATGTTATTAATTTGAATCTAAACTTAAGAATAATGGAGAGTAACAAAGGAAAAAAGTG TGAACGGGACGATACCAGAATGTTTCAATCTAGAAAAGTATAAAAGATAAGGACTAGGAC TCAAATGTATTTGGCTGACTATCGCCTGAACCTTGATGCTAAGCAAATACCATATCTTCA AGAAAAAGCCTACTCCAGTGTTTAAGAAGAAGGGAACGATTTACTAGATCATGCTATACG CAGTAAGGTTCTGATAGTTAATTACAATCGGTCCAAGTTCTAAGCGGTGTCGTCCATGCA TATATCATTTACAAGTTACTGGCGTCAACTCTTCAAATATTCAAAATATCACCTAATCAA ACTTACTAACATTTTCCTTTTTTGTTTTCCTTCTTTTATAG Now use BlastX • To what protein does these sequences code for? • are these proteins conserved in other organisms? Questions Now use BlastX • To what protein does these sequences code for? • are these proteins conserved in other organisms? Query 4 3 No protein – e-value 3.2 A conserved protein component of the small (40S) subunit of S. cerevisiae. Questions • You are told that the sequences were extracted from the same gene. How could you explain the above results? • Answer: query4 is extracted from a non-coding region (intron) and thus doesn’t code for any protein.