Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Frog’s eye view of the jungle (time frozen) Push to restart time Frog’s eye view of the jungle (timemoving) frozen) (time Frog’s eye view of the jungle (through movement filter) Push to restart time Frog’s eye view of the jungle (through movement filter) Filters: Information reducers Movement filter Filters: Information reducers Sequence filter TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TATGAGGCAA CTCGGGAGCG CCTTTAGATG AGGCCGGAGG CCCCGGCCTA TTCCCTGGGC TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA TCACAGCATC CACGGCTCTA CAAGAAGGAG GTCAAGAACT AGGCTGCCTG TCGGCGGGAC AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA AGGTGACCTT AAGAGGCCCA GAAACAGCTC CTCCACCGGC TGCTATAAAT AGATAACATG CTAGTTCTTG TTATCTGTTT CACTAGTTTC TTAGATAAAC CTCCACGCCC ATATTAAAAA AATTAGCAAA CATTCTAGGG AAACAAGCTA ATTTCCTGGG AGCCAAGGAC TGACAGACAG ATTGAACCCT AGTGCAGACA AGAAATGAGA AGTATCTATT TATCCAGGCA GAAATCCCTG GGCAGCGGCC ACGCGGCCCA AATGTGCCCT How organism is made CTCCGTAAAC CTCTAAC... How organism works From Sequence to Organism How does Nature do it? Active site ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of folding From Sequence to Organism How does Nature do it? Active site ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of folding Metabolism, Architecture Cell interaction From Sequence to Organism How does Nature do it? Active site ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of folding • Custom antibiotics Gives us: From Sequence to Organism How does Nature do it? Active site ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Gives us: Rules of folding • Custom antibiotics • Custom antibodies • Custom enzymes • New materials From Sequence to Organism How does Nature do it? ? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA CTAGTTCTTG TTATCTGTTT CACTAGTTTC TTAGATAAAC CTCCACGCCC ATATTAAAAA AATTAGCAAA CATTCTAGGG AAACAAGCTA ATTTCCTGGG AGCCAAGGAC TGACAGACAG ATTGAACCCT AGTGCAGACA AGAAATGAGA 3% 97% TCTACTTATATTCAATCCACAGGGCTA CACCTAGTTCTTGAAGAGTCTGTTGAA TGAACACATACATGGTTTATCTGTTTT TCTGTCTGCTCTGACCTCTGGCAGCTT ATGACTTATGATCAACGCACAGGGCTA TAGCCTGCCCCACTCTTAGATAAACGA ACCTTAGTGACTTCTGCTATACCAAAG TCTCCACGCCCCTCCGTAAACCTCTAA CATGATGTCAGCAAATATTAAAAATGA Rules of transcriptional and post-transcriptional control • Transcr’l initiation • Transcr’l termination/ polyA tailing • Splicing • Transl’l initiation From Sequence to Organism How does Nature do it? Natural filters/transformations • Selective transcription DNA • Selective processing • Translation • Folding Functional protein From Sequence to Organism How does Nature do it? Natural filters/transformations DNA Functional protein From Sequence to Organism How can WE do it? Simulation of Nature “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” Utterence of Wm Shakespeare Utterence of George W Bush “We must give our military every tool and weapon it needs to prevail...” ??? From Sequence to Organism How can WE do it? Surrogate Processes “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” Utterence of Wm Shakespeare Utterence of George W Bush Words/sentence; Choice of words; Sentence structure; … From Sequence to Organism How can WE do it? Natural filters/transformations Surrogate filters • Selective transcription • Gene finders • Selective processing • Translation • Folding Predicted coding regions My sequence Characteristics of coding sequences/introns From Sequence to Organism How can WE do it? Natural filters/transformations Surrogate filters • Selective transcription • Gene finders • Selective processing • Similarity finders • Translation • Folding Sequence/motif Databases My sequence From Sequence to Organism How can WE do it? Natural filters/transformations Surrogate filters • Selective transcription • Gene finders • Selective processing • Similarity finders • Translation • Feature finders • Folding Predicted features My sequence Characteristics of features From Sequence to Organism How can WE do it? Natural filters/transformations Surrogate filters • Selective transcription • Gene finders • Selective processing • Similarity finders • Translation • Feature finders • Folding • Pattern finders My sequences Statistical engine Surrogate Filters How do they work? Case studies • Gene finders • Real problems • Similarity finders • Mixed strategies • Feature finders • Pattern finders You do it Surrogate Filters Gene finders Class 1: Start/Stop codon search (Map, Frames, OrfFinder) Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA) CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA CTC CAC GCC CCT CCG TAC ACC TCT AAC ATG ATC TCA GCA AAT ATT AAA AAT GAA TAA ACT TTG TGA CAT GTA CAA ATG GAA ATA TGC AA C TCC ACG CCC CTC CGT ACA CCT CTA ACA TGA TCT CAG CAA ATA TTA AAA ATG AAT AAA CTT TGT GAC ATG TAC AAA TGG AAA TAT GCA A CT CCA CGC CCC TCC GTA CAC CTC TAA CAT GAT CTC AGC AAA TAT TAA AAA TGA ATA AAC TTT GTG ACA TGT ACA AAT GGA AAT ATG CAA Surrogate Filters Gene finders Class 1: Start/Stop codon search (Map, Frames, OrfFinder) Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA) CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATTTGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG Surrogate Filters Gene finders Class 1: Start/Stop codon search (Map, Frames, OrfFinder) Pro: Quick, simple Con: Useless for eukaryotic genomic sequences (introns) Inaccurate (start codon problem) Inaccurate (doubtful short open reading frames) Surrogate Filters Gene finders Class 2: Codon bias recognition (TestCode) Genetic Code UUU UUC UUA UUG CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG Phe Phe Leu Leu Leu Leu Leu Leu Ile Ile Ile Met Val Val Val Val UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG Ser Ser Ser Ser Pro Pro Pro Pro Thr Thr Thr Thr Ala Ala Ala Ala UAU UAC UAA UAG CAU CAC CAA CAG AAU AAC AAA AAG GAU GAC GAA GAG Tyr Tyr ochre amber His His Gln Gln Asn Asn Lys Lys Asp Asp Glu Glu UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG The code is degenerate Cys Cys opal Trp Arg Arg Arg Arg Ser Ser Arg Arg Gly Gly Gly Gly Are codons equally used? Surrogate Filters Gene finders Class 2: Codon bias recognition (TestCode) Most frequently used codons Genetic Code (human) UUU UUC UUA UUG CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG Phe Phe Leu Leu Leu Leu Leu Leu Ile Ile Ile Met Val Val Val Val UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG Ser Ser Ser Ser Pro Pro Pro Pro Thr Thr Thr Thr Ala Ala Ala Ala UAU UAC UAA UAG CAU CAC CAA CAG AAU AAC AAA AAG GAU GAC GAA GAG Tyr Tyr ochre amber His His Gln Gln Asn Asn Lys Lys Asp Asp Glu Glu UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG Codon usage is biased Cys Cys opal Trp Arg Arg Arg Arg Ser Ser Arg Arg Gly Gly Gly Gly Codon bias universal? Surrogate Filters Gene finders Class 2: Codon bias recognition (TestCode) Pro: Quick, simple, available through GCG Better than Class 1 in excluding false open reading frames Con: Useless for eukaryotic genomic sequences (introns) Gives only general areas of open reading frames Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Principle Step 1: Create model through extensive training set * Training set = proven or suspected genes * Organism-specific Step 2: Assess candidate genes through filter of model Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Step 1: Create model through extensive training set Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT AAA AAC AAG AAT ACA ... TTG TTT Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Step 1: Create model through extensive training set AAAA: 33% Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT AAA AAC AAG AAT ACA ... TTG TTT AAAC: 25% AAAG: 12% AAAT: 30% Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Step 1: Create model through extensive training set Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT AAA AAC AAG AAT ACA ... TTG TTT AACA: 30% AACC: 20% AACG: 15% AACT: 35% Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Step 2: Assess candidate genes 3rd order Markov model Candidate gene AAAGCAA… A 0.33 0.30 0.35 0.30 0.25 C 0.25 0.20 0.15 0.15 0.20 G 0.12 0.15 0.20 0.20 0.15 T 0.30 0.35 0.30 0.25 0.35 AAA AAC AAG AAT ACA ... TTG 0.25 0.30 0.15 0.30 TTT 0.30 0.25 0.10 0.35 0.12 Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Step 2: Assess candidate genes 3rd order Markov model Candidate gene AAAGCAA… A 0.33 0.30 0.35 0.30 0.25 C 0.25 0.20 0.15 0.15 0.20 G 0.12 0.15 0.20 0.20 0.15 T 0.30 0.35 0.30 0.25 0.35 AAA AAC AAG AAT ACA ... TTG 0.25 0.30 0.15 0.30 TTT 0.30 0.25 0.10 0.35 0.12 x 0.15 Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Step 2: Assess candidate genes 3rd order Markov model Candidate gene AAAGCTA… A 0.33 0.30 0.35 0.30 0.25 C 0.25 0.20 0.15 0.15 0.20 G 0.12 0.15 0.20 0.20 0.15 T 0.30 0.35 0.30 0.25 0.35 AAA AAC AAG AAT ACA ... TTG 0.25 0.30 0.15 0.30 TTT 0.30 0.25 0.10 0.35 So far, not a good candidate! 0.12 x 0.15 . . . Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Pro: Almost most accurate method known Con: Needs big training set May miss genes of foreign origin Will miss very small genes Surrogate Filters Gene finders Class 3: Hidden Markov Model (HMM)-based recognition Pro: Almost most accurate method known Con: Needs big training set May miss genes of foreign origin Will miss very small genes Surrogate Filters Scenario I – Case of the Hidden Heterocyst Case of the Hidden Heterocyst NH3 heterocysts N2 O2 NH3 Matveyev and Elhai (unpublished) Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome Transposon 1. Use transposon mutagenesis Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome Transposon 1. Use transposon mutagenesis to find a mutant defective in heterocyst differentiation Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATA ATCAATGACTATCAGACAGAGAATCATCGTGCTGTCA GTAAAACCTCTGATTTCGATCTTTACCATAATTGTTA TGTTGTAATGACTAACCAGACTATCTTTTACAGAGCT TCTGGTTAACACTTGTCTAATTAGACATTGATAATGT TTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAAT TACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACG AGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTT AACTTCAGAAATTCACGGCGGAAATCCATAGTTATTA TTACTTATGACTAAAACAAAATTACTATGGCGGCTTG TTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTA AAGTCCCACTAACTTTTTTCTCATCTATTGCTATATT TCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCA AATAACAAACTCATTTTTAGTAGATATTTCATGCAAA CTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTAC AGCCACTCCACAAACCTTAGAATGGCTACTCAATATT GCAATTGATCATGAATATCCCACTGGTAGAGCAGTTT TAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGT TGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA 1. Use transposon mutagenesis to find a mutant defective in heterocyst differentiation 2. Sequence out from transposon Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Nostoc genome AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATA ATCAATGACTATCAGACAGAGAATCATCGTGCTGTCA GTAAAACCTCTGATTTCGATCTTTACCATAATTGTTA TGTTGTAATGACTAACCAGACTATCTTTTACAGAGCT TCTGGTTAACACTTGTCTAATTAGACATTGATAATGT TTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAAT TACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACG AGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTT AACTTCAGAAATTCACGGCGGAAATCCATAGTTATTA TTACTTATGACTAAAACAAAATTACTATGGCGGCTTG TTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTA AAGTCCCACTAACTTTTTTCTCATCTATTGCTATATT TCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCA AATAACAAACTCATTTTTAGTAGATATTTCATGCAAA CTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTAC AGCCACTCCACAAACCTTAGAATGGCTACTCAATATT GCAATTGATCATGAATATCCCACTGGTAGAGCAGTTT TAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGT TGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA Do it 1. Use transposon mutagenesis to find a mutant defective in heterocyst differentiation 2. Sequence out from transposon 3. Find gene boundaries 4. Identify gene Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes 1. Go to http://www.vcu.edu/~elhaij/BioInf 2. Open second browser (Ctrl-N in Netscape) Go to same site (copy and paste URL) 3. In 1st browser, go to Program List Click on Gene Finders Open GeneMark 4. In 2nd browser, open Nostoc sequence Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Mission successful: >Translation: 397..639 (direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNAR VTQTVKLVRLEKFLSLQKSVEEALENVK* … or was it? Check predicted protein against databases Surrogate Filters Similarity finders Blast • BlastP: Protein sequence to search protein database • BlastN: Nucleotide sequence to search nucleotide database • BlastX: Nucleotide sequence (translated) to search protein database • TBlastN: Protein sequence to search (translated) nucleotide database • Blast2Seq: Compare two sequences you specify FastA • (Various flavors) Do it Pfam (Protein motif families) Finds conserved motifs similar to protein sequence Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Mission successful: >Translation: 397..639 (direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNAR VTQTVKLVRLEKFLSLQKSVEEALENVK* VLGSK Why? • GeneMark correct: Conservation of noncoding regions • GeneMark wrong: Fooled by weird aa sequence or start codon Case of the Hidden Heterocyst Strategy to find heterocyst differentiation genes Moral Automated gene finders are wonderful, but common sense is better Don’t trust automated annotation Surrogate Filters Feature finders Hidden Markov model-based methods • Good for contiguous features (e.g. signal sequences) • Not good with features with gaps (e.g. promoters) Ad hoc methods • Feature-specific rules (e.g. tandem repeats, terminators) Position-dependent frequency tables = Position-specific scoring matrix (PSSM) = Weight table Surrogate Filters Feature finders Position-dependent frequency tables Some of 106 aligned human promoter sequences (near -26) Consensus CCCTATATAAGGC... CGCTATAAAAACT... GGGTATATAAGCG... GGCTATATAAAAC... TTCTATAAAGCGG... CCCTATAAAACCC... GAGTATAAAGCAC... GGTTATAAAAACA... CAGTATAAAAGGG... CCGTATAAATAGG... TCCCATATAAGCC... TATAAA histone H1t HMG-17 b'-tubulin b'2 a'-actin skel-m. a'-cardiac actin b'-actin keratin I 50K vimentin a'1(I) collagen a'2(I) collagen fibronectin Surrogate Filters Feature finders Position-dependent frequency tables CCCTATATAAGGC... CGCTATAAAAACT... GGGTATATAAGCG... GGCTATATAAAAC... TTCTATAAAGCGG... CCCTATAAAACCC... GAGTATAAAGCAC... GGTTATAAAAACA... CAGTATAAAAGGG... CCGTATAAATAGG... TCCCATATAAGCC... Some of 106 aligned human promoter sequences (near -26) A T C G 21 16 28 35 29 22 24 25 ------------- 0 87 13 0 100 0 0 0 0 100 0 0 100 0 0 0 81 19 0 0 91 0 0 9 histone H1t HMG-17 b'-tubulin b'2 a'-actin skel-m. a'-cardiac actin b'-actin keratin I 50K vimentin a'1(I) collagen a'2(I) collagen fibronectin 57 21 0 22 32 6 15 47 15 10 33 42 26 11 28 34 Surrogate Filters Feature finders Position-Specific Scoring Matrix in action aceB atpI bioB glnA glnH lacZ rpsJ serC sucA trpE ACTATGGAGCATCTGCACATGAAAACC ACCTCGAAGGGAGCAGGAGTGAAAAAC ACGTTTTGGAGAAGCCCCATGGCTCAC ATCCAGGAGAGTTAAAGTATGTCCGCT TAGAAAAAAGGAAATGCTATGAAGTCT TTCACACAGGAAACAGCTATGACCATG AATTGGAGCTCTGGTCTCATGCAGAAC GCAACGTGGTGAGGGGAAATGGCTCAA GATGCTTAAGGGATCACGATGCAGAAC CAAAATTAGAGAATAACAATGCAAACA unknown Experimentally proven start sites Surrogate Filters Feature finders Position-Specific Scoring Matrix in action aceB atpI bioB glnA glnH lacZ rpsJ serC sucA trpE ACTATGGAGCATCTGCACATGAAAACC ACCTCGAAGGGAGCAGGAGTGAAAAAC ACGTTTTGGAGAAGCCCCATGGCTCAC ATCCAGGAGAGTTAAAGTATGTCCGCT TAGAAAAAAGGAAATGCTATGAAGTCT TTCACACAGGAAACAGCTATGACCATG AATTGGAGCTCTGGTCTCATGCAGAAC GCAACGTGGTGAGGGGAAATGGCTCAA GATGCTTAAGGGATCACGATGCAGAAC CAAAATTAGAGAATAACAATGCAAACA unknown Experimentally proven start sites Surrogate Filters Feature finders Position-Specific Scoring Matrix in action aceB atpI bioB glnA glnH lacZ rpsJ serC sucA trpE ACCACATAACTATGGAGCATCTGCACATGAAAACC ACCTCGAAGGGAGCAG.....GAGTGAAAAAC ACGTTTTGGAGAAGC...CCCATGGCTCAC ATCCAGGAGAGTTA.AAGTATGTCCGCT TAGAAAAAAGGAAATG.....CTATGAAGTCT TTCACACAGGAAACAG....CTATGACCATG AATTGGAGCTCTGGTCTCATGCAGAAC GCAACGTGGTGAGGG...GAAATGGCTCAA GATGCTTAAGGGATCA....CGATGCAGAAC CAAAATTAGAGAATA...ACAATGCAAACA A C G T Surrogate Filters Feature finders Position-Specific Scoring Matrix in action aceB ACCACATAACTATGGAGCATCT.GCACATGAAAACC atpI ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ TTCACACAGGAAACAG....CTATGACCATG rpsJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGGG...GAAATGGCTCAA sucA GATGCTTAAGGGATCA....CGATGCAGAAC trpE CAAAATTAGAGAATA...ACAATGCAAACA A C G T Surrogate Filters Pattern finders Specified patterns (FindPatterns, PatScan) e.g. Find instances of restriction sites New pattern discovery (Meme, Gibbs sampler) Human sequences 5’ to transcriptional start snRNA U1 (pU1-6) histone H1t HMG-14 TP1 protamine P1 nucleolin snRNP E rp S14 rp S17 ribosomal p. S19 a'-tubulin ba'1 b'-tubulin b'2 a'-actin skel-m. a'-cardiac actin b'-actin AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT GCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG TGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTT GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCA CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC TCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCC CGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA Surrogate Filters Pattern finders How do pattern finders work? snRNA U1 (pU1-6) histone H1t HMG-14 TP1 protamine P1 AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table GACAGGGCAGAA GCCCGGGTGTTT GCCGGGGACGCG GCCCCCGGGCCT GCCGCAGAGCTG A T C G 0.208 0.160 0.283 0.349 0.292 0.217 0.236 0.255 0.000 0.867 0.132 0.000 0.999 0.000 0.000 0.000 0.000 0.999 0.000 0.000 0.999 0.000 0.000 0.000 0.811 0.189 0.000 0.000 0.905 0.000 0.000 0.95 0.575 0.208 0.000 0.217 0.321 0.057 0.151 0.472 0.151 0.104 0.330 0.415 0.264 0.113 0.283 0.340 Surrogate Filters Pattern finders How do pattern finders work? snRNA U1 (pU1-6) histone H1t HMG-14 TP1 protamine P1 AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score Surrogate Filters Pattern finders How do pattern finders work? snRNA U1 (pU1-6) histone H1t HMG-14 TP1 protamine P1 AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT Step 1. Arbitrarily choose candidate pattern from a sequence Step 2. Find best matches to pattern in all sequences Step 3. Construct position-dependent frequency table based on matches Step 4. Calculate relative probability of matches from frequency table Step 5. If probability score high, remember pattern and score Step 6. Repeat Steps 1 - 5 Surrogate Filters Scenario II – Case of the Masked Motif • You’ve found a gene related to Purple Tongue Syndrome • BlastP: Encoded protein related to cAMP-binding proteins • Are the similarities trivial? Related to cAMP binding? • Does your protein contain cAMP-binding site? • What IS a cAMP-binding site? Task 1. Determine what is a cAMP-binding site 2. Determine if your protein has one Surrogate Filters Scenario II – Case of the Masked Motif Strategy 1. Collect sequences of known cAMP-binding proteins 2. Run Meme, a pattern-finding program Ask it to find any significant motifs Do it 3. Rerun Meme. Demand that every protein has identified motifs 4. Run Pfam over known sequence to check Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Progressive External Ophthalmoplegia (PEO) • Slow paralysis of voluntary eye muscles • Many other symptoms (e.g., frequent deafness) • Loss of mitochondrial DNA Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Progressive External Ophthalmoplegia (PEO) • Slow paralysis of voluntary eye muscles • Many other symptoms (e.g., frequent deafness) • Loss of mitochondrial DNA Inheritance • Mendelian • Autosomal dominant • Linked to chromosome 4q34 Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Progressive External Ophthalmoplegia (PEO) • Slow paralysis of voluntary eye muscles • Many other symptoms (e.g., frequent deafness) • Loss of mitochondrial DNA Inheritance • Mendelian • Autosomal dominant • Linked to chromosome 4q34 Your task • Examine sequence of 4q34 region • Assess likelihood that a gene in the area could cause disease symptoms Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Examining Sequence of 4q34 Region tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgct ctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatgccctctgtggccctggaaccttagtgacttctgctat accaaagtctccacgcccagggtgacacgcagctgcagctccgtaaacctctaacatgatgtcagcaaatattaaaaaaaaaaagtttataaaaacaatgaataaactttgttaaaggtacaaatgaaaat tagcaaacatgggaagataattgagtaaagagtttaaagttaaaaacgaattgcagtcattctaggggaaggaacagttgtatttgaaaacctgtatggttacatgaactgcctaaaaaacaagctaagga aaattaaagctcagatttatatattttaagaaattaattgcaattaatttcctgggattaaatagcatttcctcaaccccagctgtcattaaaaagaggcaaatacagccaaggactggatcttctccgga aggctgacagcactgaccctcaagaaggcaccggctgacagacagaacattctgccctaatatgtgctgaaattccgctgagagcagagtggtacattgaaccctttaggggcttacaaaagaagtgtcct gtgttttagagtcacagagttttgcagaaacaagtatgaattcacctagtggccccctgcaccaggtctttcctgtgggcactgagtgcagacacatcaatatgtaatagcagaatgaatgactgaacgaa cgattgaatgaaaagaaatgagaggcagcaggttgtcagattctatgaggcaatcacagcatcaggtgaccttagtatctatttgagaggactgccatttattctcgggagcgcacggctctaaagaggcc catatccaggcagtgagctctggtggggggcgcctttagatgcaagaaggaggaaacagctcgaaatccctgggcctgagcgcggcccgtgcaggccggagggtcaagaactctccaccggcggcagcggc ccggtgtctgccccggcttcgccccggcctaaggctgcctgtgctataaatacgcggcccacatgccgcggtgacacggtgttccctgggctcggcgggacagataacatgaatgtgccctttaaacgtcc caagttgcagggacagcccccggcccagcctcgctcccggaagcgccttcgcccccgatgccctctgcagctgggaggagggggcgccccgcacctgcccagccaatgcgcggcgcgagcgccggccgcga cccgcctcctctcgcgagagcccggcggggatataagggggagctgcgggccaggcggcggccccctagcgtcgcgcagggtcggggactgcgcgcggtgccaggccgggcgtgggcgagagcacgaacgg gctgcctgcgggctgagagcgtcgagctgtcaccatgggtgatcacgcttggagcttcctaaaggacttcctggccgggggcgtcgccgctgccgtctccaagaccgcggtcgcccccatcgagagggtca aactgctgctgcaggtgaggaccgcgcggtgcaagaggcgggcgcgggcgcggcgggccgggcggggcgcgcgatgcggcgcgagctgcagggcgcggggcgccgcggaaaatctgcgccaggccacaggc ccgggcgcccgcccgcccgcgggggaagaaggtgccctctgcgtagagacaggtccagcgtcagtcgcagattcctggtgtcgggtggcgcccggcgttcgggtgtctatatatggaaacccacccggagc cggtttacgtgtgccagatcctgcgcccgtgacagcacgggcgtgcactcaggcccggaggcacctagtgattgccagtatttttggcaccgtcttatgcgcacgcacctttacaataaaaacatcaaaat aatcatcacccaagaattcccttatcgtatctcatgcacaatgctgtatgtaggctgacgccttcatctttatgtaacctctgtgagagagttattcttctccattttacagatgaagctgaggttttgaa atattaagaaacaattttcggaataaactcagatcatcctgtctccaaatcttttcctcccctacctggtcgctgaatggtttatcatcctctcgtgttttcctccacctgcccaaaaggtcagggcccct caatgaggaagagcccaatttgggagtcagaattactaacaacaaaacccccacaaattgctcacaacggcagcaaacccttaataattgattacttggattatctgcttgaaaactttggaggcctaatg tttagtggatttattctccttcctctattagagcatctagtagagatcctcatctccagggtgatcagagtgacactgagaaattgtcattttttggccatcatgtctattaaatccaaagccctttgaag cagggagtgttactcatttctgtcccccagtaagcccctcatacagttctcaaacctagggaaagtgaaataaataaatggctatagctttatataattcaatcaccttttcagtttatttggggcaatac ctttccctcaaataccctaataattgaagcaacattggattattttggcttgttatccagtaactaacatggataacagtatccatttacacgtcctcgtatccatttgatttcctcatcctttttttctt caaaaaaaaaatctaggaagtgcaaaccttttttttttctcctgtcctcttcccttctctctaccctgcctgtcctctgtcacccaccctcccctccaccaggtccagcatgccagcaaacagatcagtgc tgagaagcagtacaaagggatcattgattgtgtggtgagaatccctaaggagcagggcttcctctccttctggaggggtaacctggccaacgtgatccgttacttccccacccaagctctcaacttcgcct tcaaggacaagtacaagcagctcttcttagggggtgtggatcggcataagcagttctggcgctactttgctggtaacctggcgtccggtggggccgctggggccacctccctttgctttgtctacccgctg gactttgctaggaccaggttggctgctgatgtgggcaagggcgccgcccagcgtgagttccatggtctgggcgactgtatcatcaagatcttcaagtctgatggcctgagggggctctaccagggtttcaa cgtctctgtccaaggcatcattatctatagagctgcctacttcggagtctatgatactgccaagggtgagagaggggcatcggggagaaggagggtggtgtggaaagaggatcctatgggatctataactc acaaaggacctgatatatattgatcttgttttttctagtctctgggataattgaggcttctgaatgaggaggtgatgtgcataagttaatagctgaagcgttccttgtgtcctctactgaaataaactctg gcctttagttattcagagaggaggaggggggagcctgtctccctctagacacagccatagcagttactgagtttaacttgaagccacttccaatgccctgtatacaagctgagcactgcccctccggggtc cggagagggcagcagccacctttgctgtctgcctggtcatatgtgaagcacctgcacaggggcaggttccccgcaaggtcagagcatggagctggaggtgcagtggcctctctccctccacctgctttctg ctgagaacaggcacttcatagccgttcggcttctgggctctgtccacagggatgctgcctgaccccaagaacgtgcacatttttgtgagctggatgattgcccagagtgtgacggcagtcgcagggctggt gtcctacccctttgacactgttcgtcgtagaatgatgatgcagtccggccggaaagggggtaagcttgtgctctactcatctaaacttgtttggttttgcccgaggagaacattttacagggctcctttca gtcttccttactggaaattaattttcaaaattatttgataaggacttagggaagaaagatggtattaattccccctaacgttctcaactatcctattagggaaaagtattttccattttattagagatgat aagaacatgaatagtaagacatttagatgtgaatttaactaggtatccagcattatagagaccctaggccctcttcccttagagcctgggtgcaaaagctagggaaaagaagtagttagctacttcttaca aagaactcttgcttccctcctagttacaggtgttagtgggatggggtgtttagctgggtagagatggcctgaagcaatctgttgtgccagagaaagttttggcttctataggttgaaccatatgaaattgc cactttaaaagtcaaaaacagtccaatgttagcagtttcgtatgtttcaacgaatagttacagccttttatttagactgcataacctcgtgcaggatcatctgaggctcagcctcagttcggtcctccata aaaaaaggtaaccgcgtagcataatactcctgctccactgcgcccttcttgtttcgcagttgggcagtccatgaattacttggttaattgccccagttcttcactgaccttgaactaatggagtaggaatg acaggagacccagcctgccagtgaagcaaggaaggagatgtccagtgggatgttgcatggagctgggactccatgcccagatgaccctgattttataaaactggtaacagtgtgtacagatatgtttcagg ggaaaagtctctttcctccagcgttacggagccctcaccagcatttgtttccacagccgatattatgtacacggggacagttgactgctggaggaagattgcaaaagacgaaggagccaaggccttcttca aaggtgcctggtccaatgtgctgagaggcatgggcggtgcttttgtattggtgttgtatgatgagatcaaaaaatatgtctaatgtaattaaaacacaagttcacagatttacatgaacttgatctacaag ttcacagatccattgtgtggtttaatagactattcctaggggaagtaaaaagatctgggataaaaccagactgaaggaatacctcagaagagatgcttcattgagtgttcattaaaccacacatgtatttt Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Strategy • Assume that encoded protein is in mitochondria • Protein has function associated with mitochondrial location? – Use Gene finder to identify protein sequence(s) – Use Similarity finder to identify possible function • Protein has structure associated with mitochondrial location? – Use Feature finders to identify pertinent regions – (What ARE pertinent regions?) Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through FGene Name: PEO-related_gene? First three lines of sequence: tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg fgene Wed Feb 27 16:55:29 GMT 2002 >PEO-related_gene? length of sequence 5768 number of predicted exons - 5 positions of predicted exons: 1607 1717 w= 17.84 ORF: 1607 1717 2985 3231 w= 9.13 ORF: 2985 3230 3421 3471 w= 6.08 ORF: 3423 3470 3980 4120 w= 12.62 ORF: 3982 4119 5035 5192 w= 1.93 ORF: 5037 5192 Length of Coding region708bp Amino acid sequence MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG IIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSG RKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV* 235aa Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through FGeneSH Name: PEO-related_gene? First three lines of sequence: tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg Fgenesh Wed Feb 27 16:59:14 GMT 2002 FGENESH 1.0 Prediction of potential genes in Human Time: Wed Feb 27 16:59:14 2002 Seq name: PEO-related_gene? Length of sequence: 5768 GC content: 48 Zone: 2 Positions of predicted genes and exons: G Str Feature Start End Score ORF 1 1 1 1 1 1 + + + + + + 1 2 3 4 TSS CDSf CDSi CDSi CDSl PolA 1216 1607 2985 3980 5035 5471 - 1717 3471 4120 5192 -2.70 18.01 52.41 20.99 2.32 0.92 1607 1607 2985 2985 3421 3982 3980 5037 5035 genomic DNA Len ----- FGENE output 1717 w= 111 17.84 1717 3231 w= 4869.13 3470 3471 w= 1386.08 4119 4120 w= 156 12.62 5192 5192 w= 1.93 Predicted protein(s): >FGENESH 1 4 exon (s) 1607 5192 298 aa, chain + MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS How to decide where exons are? P Exon Intron Exon Intron Exon DNA mRNA hnRNA AAAAAAAA Strategy • Compare sequence of 4q34 region to sequence of mRNA • Sequence of mRNA may be in cDNA library • Expressed Sequence Tag (EST) library Problems • Library may not exist • Expression of gene may be low Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through BlastN (x human est’s) Final Score Card for Gene Finders Feature FGene (splice site recognition) Transcription Start Site …1607-1717 Exon 1 Exon 2 Exon X Exon 3 PolyA site 2985-3231 3421-3471 3980-4120 5035-5192… FGeneSH (FGene + HMM model) BlastN of EST library (compare with known) 1216 1501 …1607-1717 …1607-1717 2985-3471 2985-3471 3980-4120 5035-5192… ??? 3980-4120 5035-5192… ??? MORAL: Trust, but verify. Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Strategy • Assume that encoded protein is in mitochondria • Protein has function associated with mitochondrial location? – Use Gene finder to identify protein sequence(s) – Use Similarity finder to identify possible function • Protein has structure associated with mitochondrial location? – Use Feature finders to identify pertinent structures – (What ARE pertinent structures?) Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through BlastP Name: PEO-related_gene? First three lines of sequence: tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg Fgenesh Wed Feb 27 16:59:14 GMT 2002 FGENESH 1.0 Prediction of potential genes in Human Time: Wed Feb 27 16:59:14 2002 Seq name: PEO-related_gene? Length of sequence: 5768 GC content: 48 Zone: 2 Positions of predicted genes and exons: G Str Feature Start End Score ORF 1 1 1 1 1 1 + + + + + + 1 2 3 4 TSS CDSf CDSi CDSi CDSl PolA 1216 1607 2985 3980 5035 5471 - 1717 3471 4120 5192 -2.70 18.01 52.41 20.99 2.32 0.92 1607 2985 3982 5037 genomic DNA Len - 1717 3470 4119 5192 Predicted protein(s): >FGENESH 1 4 exon (s) 1607 5192 298 aa, chain + MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS VQGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM 111 486 138 156 Surrogate Filters Scenario III – Case of the Mortal Mitochondrion Run 4q34 region through BlastP Summary • One protein in region • Contains mitochondrial carrier motifs • Similar to ATP/ADP transporter • Mitochondrial signal sequence? Reasonable candidate for PEO-related protein Complex gene discovery Your turn: Repeat and extend characterization of PEO-related gene 1. Take same sequence (FastA format) e-mailed to you 2. Get better estimate of promoter and polyA site (e.g. by TSSW and PolyASH) (Is there a TATA box upstream from the predicted promoter?) 3. Find encoded protein sequence by suitable method (e.g. FGeneSH(GC) or comparison with cDNA) 4. Continue characterization of protein * Contains signal sequence? * Contains transmembrane domains? Filter limitation Inevitable… but whose filter? Filters controlled by outside programmers Filters controlled by you