Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Welcome to Introduction to Bioinformatics Wednesday, 1 December • Intro to Scenario 8 Identification of genes of foreign origin E. coli K12 E. coli O157:H7 Scenario 1 Comparison of genomes of pathogenic and nonpathogenic E. coli E. coli: What makes it kill? TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA Gene finder Blast + parser Gene finder Pathogenosis specific Pathogen specific (~1000!) ? E. coli: What makes it kill? Pathogenecity Islands (PAI) Virus-related genes DNA Pathogen Nonpathogen Disease-related genes E. coli: What makes it kill? TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA Pathogen specific Gene finder Blast + parser Gene finder (~1000!) Pathogenosis specific ? foreign genes How to find foreign genes? Current methods • % GC • Dinucleotide frequencies • Codon bias • Amino acid bias How to find foreign genes? Dinucleotide frequencies Gene from Prochlorococcus marinus MED4 ATGGAGATTGTTTGTAATCAAAATGAATTTAATTATGCTATTCAATTA GTTAGTAAAGCAGTTGCTTCAAGACCTACGCATCCTATCCTTGCAAAT TTACTTCTAACAGCTGATCAAGGTACTAATAAAATTAGTTTAACTGGA TTTGATTTGAATCTAGGAATACAAACTTCATTTGATGCAACTGTAAAC AAAAGTGGAGCAATTACAATTCCATCTAAACTTTTATCTGAAATAGTT AATAAACTACCAAGCGAAACTCCTGTCTCTCTTGATGTTGATGAGAGT TCTGACAATATTTTAATTAAAAGTGATAGGGGTTCTTTTAATATTAAA GGTATTCCATCAGACGATTACCCAAGCTTACCGTTTGTAGAAAGTGGT ACATCTTTGAATATTGATCCAAGTTCTTTTTTAAAAGCTTTAAAATTA ACTATATTCGCTAGTAGTAGTGATGATTCAAAGCAATTACTCACAGGA GTAAATTTTACATTTAATTTAAAATATTTGGAGTCAGCTGCAACAGAT GGGCATAGATTGGCTGTTGTTTTGGTTGATAACAAAGAAAATTTTGAT GAAAAAGAAGATTTTGCTTCAAATGAAGAAAACTTATCAGTTACTATA CCAACAAGATCTTTAAGAGAAATTGAAAAGCTTGTTAGCCTTAGAAGT TCTGAAAATTCAATTAAACTTTTCTATGACAAAGGTCAAGTAGTATTT ATTTCCTCTAATCAAATAATTACTACTAGAACCCTTGAAGGTTCTTAT CCAAATTATTCTCAATTAATACCTGATAATTTTACTAAAATTTTTACA TTTAATACAAAAAAAATAATCGAATCACTTGAAAGAATAGCAGTTTTA GCAGACCAACAAAGTAGTGTCGTTAAGATTAAACTTAATGAAAAGGAT TTAGCATTAGTCAGTGCTGATGCTCAAGACATAGGGAATGCCAGCGAA TTAGTTCCTGTATCTTTTTATTTTGATCAATTTGATATAGCTTTTAAT GTAAGGTATTTATTAGAAGGTTTAAAAGTTATATCAAGTGAAAATGTA ATTTTTAAATGTAATCTTCCAACTACTCCAGCTGTTTTAGTTCCAGAA A C G T A 167 47 80 116 C 61 26 10 65 G 65 30 23 59 T 117 59 64 168 ρ*XY = f*XY / f*X f*Y How to find foreign genes? Dinucleotide frequencies Calculate ρ*AA Study Question 3: in the following 50-nt sequence: TGATGACAGTCGATTTTTCGGTAGG ATAACTGCCATGCCTCTCAAAGTAC ρ*XY = f*XY / f*X f*Y How to find foreign genes? Dinucleotide frequencies δ* (f, g) = (1/16) Σ | ρ*XY(f) - ρ*XY(g) | How to find foreign genes? Dinucleotide frequencies δ* (f, g) = (1/16) Σ | ρ*XY(f) - ρ*XY(g) | Calculate δ*(human,mouse)? How to find foreign genes? Current methods • % GC • Dinucleotide frequencies • Codon bias • Amino acid bias How to find foreign genes? Cholera toxin locus 100 kbases How to find foreign genes? Current methods • % GC • Dinucleotide frequencies • Codon bias • Amino acid bias How to find foreign genes? Current methods • % GC 0.29274613 0.24678363 0.2797619 0.32916668 0.29405162 0.36720142 0.34567901 0.28865057 0.29333332 0.35947713 0.39583334 0.3116883 0.6219512 0.26923078 0.28615862 0.28374657 0.2905983 0.30982906 0.29574862 0.25382262 ... 0.27824858 0.3089947 0.3152174 0.37955555 0.26300985 0.3510848 0.34285715 0.33333334 0.3200222 0.32882884 0.29283488 0.2754821 0.37037036 0.3197176 0.30438185 0.3477868 0.3215859 0.34751773 0.30287206 0.34519956 %GC of genes of Prochlorococcus 0.32350427 0.32791328 0.46096095 0.26495728 0.3646139 0.31604227 0.38206628 0.2717087 0.3412162 0.34351662 0.47674417 0.30488145 0.29738563 0.27430555 0.272578 0.31501833 0.3195021 0.35598704 0.3068182 0.3283208 0.321013 0.2984127 0.31343284 0.28431374 0.29989868 0.2984234 0.33838382 0.33004925 0.32882375 0.26504064 0.296 0.36578172 0.42553192 0.32748538 0.32188296 0.31860465 0.29113925 0.30678466 0.3878205 0.26345214 0.28214577 0.32879817 0.34343433 0.32193732 0.33516484 0.23024055 0.32380953 0.31546623 0.29943502 0.32506204 0.30749354 0.28992248 0.36120042 0.32178217 0.35273737 0.2952816 0.3432836 0.32900432 0.3469388 0.2958245 How to find foreign genes? Current methods • % GC 0.29274613 0.24678363 0.2797619 0.32916668 0.29405162 0.36720142 0.34567901 0.28865057 0.29333332 0.35947713 0.39583334 0.3116883 0.6219512 0.26923078 0.28615862 0.28374657 0.2905983 0.30982906 0.29574862 0.25382262 ... 0.27824858 0.3089947 0.3152174 0.37955555 0.26300985 0.3510848 0.34285715 0.33333334 0.3200222 0.32882884 0.29283488 0.2754821 0.37037036 0.3197176 0.30438185 0.3477868 0.3215859 0.34751773 0.30287206 0.34519956 %GC of genes of Prochlorococcus 0.32350427 0.32791328 0.46096095 0.26495728 0.3646139 0.31604227 0.38206628 0.2717087 0.3412162 0.34351662 0.47674417 0.30488145 0.29738563 0.27430555 0.272578 0.31501833 0.3195021 0.35598704 0.3068182 0.3283208 0.321013 0.2984127 0.31343284 0.28431374 0.29989868 0.2984234 0.33838382 0.33004925 0.32882375 0.26504064 0.296 0.36578172 0.42553192 0.32748538 0.32188296 0.31860465 0.29113925 0.30678466 0.3878205 0.26345214 0.28214577 0.32879817 0.34343433 0.32193732 0.33516484 0.23024055 0.32380953 0.31546623 0.29943502 0.32506204 0.30749354 0.28992248 0.36120042 0.32178217 0.35273737 0.2952816 0.3432836 0.32900432 0.3469388 0.2958245 How to find foreign genes? Current methods • % GC 0.29274613 0.24678363 0.2797619 0.32916668 0.29405162 0.36720142 0.34567901 0.28865057 0.29333332 0.35947713 0.39583334 0.3116883 0.6219512 0.26923078 0.28615862 0.28374657 0.2905983 0.30982906 0.29574862 0.25382262 ... 0.27824858 0.3089947 0.3152174 0.37955555 0.26300985 0.3510848 0.34285715 0.33333334 0.3200222 0.32882884 0.29283488 0.2754821 0.37037036 0.3197176 0.30438185 0.3477868 0.3215859 0.34751773 0.30287206 0.34519956 %GC of genes of Prochlorococcus 0.32350427 0.32791328 0.46096095 0.26495728 0.3646139 0.31604227 0.38206628 0.2717087 0.3412162 0.34351662 0.47674417 0.30488145 0.29738563 0.27430555 0.272578 0.31501833 0.3195021 0.35598704 0.3068182 0.3283208 0.321013 0.2984127 0.31343284 0.28431374 0.29989868 0.2984234 0.33838382 0.33004925 0.32882375 0.26504064 0.296 0.36578172 0.42553192 0.32748538 0.32188296 0.31860465 0.29113925 0.30678466 0.3878205 0.26345214 0.28214577 0.32879817 0.34343433 0.32193732 0.33516484 0.23024055 0.32380953 0.31546623 0.29943502 0.32506204 0.30749354 0.28992248 0.36120042 0.32178217 0.35273737 0.2952816 0.3432836 0.32900432 0.3469388 0.2958245 How to find foreign genes? Current methods • % GC • Dinucleotide frequencies • Codon bias • Amino acid bias Good for large blocks of nucleotides Not as good for individual genes Need an indicator more information-dense How to find foreign genes? Markov Models Building the model AAAA: 10% Training Set AAA AAC AAG AAT ACA ... TTG TTT AAAC: 15% AAAG: 40% AAAT: 35% How to find foreign genes? Markov Models Building the model Training Set AAA AAC AAG AAT ACA ... TTG TTT AACA: 25% AACC: 45% AACG: 25% AACT: 5% How to find foreign genes? Markov Models Using the model 3rd order Markov model Candidate gene AAAACAA… A 0.10 0.25 0.25 0.25 0.15 C 0.15 0.45 0.20 0.20 0.20 G 0.40 0.25 0.30 0.30 0.25 T 0.35 0.05 0.25 0.25 0.40 AAA AAC AAG AAT ACA ... TTG 0.20 0.50 0.05 0.25 TTT 0.10 0.55 0.25 0.10 0.10 How to find foreign genes? Markov Models Analyze sequence Compare test sequence to model model Produce new sequence per model How to find foreign genes? Markov Models Analyze sequence Take a test run through Hamlet.pl model Produce new sequence per model Scenario 7 Gene Identification How do you tell if an orf is real? Genetic Code UUU UUC UUA UUG CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG Phe Phe Leu Leu Leu Leu Leu Leu Ile Ile Ile Met Val Val Val Val UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG Ser Ser Ser Ser Pro Pro Pro Pro Thr Thr Thr Thr Ala Ala Ala Ala UAU UAC UAA UAG CAU CAC CAA CAG AAU AAC AAA AAG GAU GAC GAA GAG Tyr Tyr ochre amber His His Gln Gln Asn Asn Lys Lys Asp Asp Glu Glu UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG The code is degenerate Cys Cys opal Trp Arg Arg Arg Arg Ser Ser Arg Arg Gly Gly Gly Gly Are codons equally used? Scenario 7 Gene Identification How do you tell if an orf is real? Most frequently used codons Genetic Code (human) UUU UUC UUA UUG CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG Phe Phe Leu Leu Leu Leu Leu Leu Ile Ile Ile Met Val Val Val Val UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG Ser Ser Ser Ser Pro Pro Pro Pro Thr Thr Thr Thr Ala Ala Ala Ala UAU UAC UAA UAG CAU CAC CAA CAG AAU AAC AAA AAG GAU GAC GAA GAG Tyr Tyr ochre amber His His Gln Gln Asn Asn Lys Lys Asp Asp Glu Glu UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG Cys Cys opal Trp Arg Arg Arg Arg Ser Ser Arg Arg Gly Gly Gly Gly The third position is biased Scenario 7 Gene Identification How do you tell if an orf is real? If AT If CG G G C A T ATG CGG TGG GCC CAA CCA CAT CGT GGG CAG TCC CTT Most frequently used codons Genetic Code (human) UUU UUC UUA UUG CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG Phe Phe Leu Leu Leu Leu Leu Leu Ile Ile Ile Met Val Val Val Val UCU UCC UCA UCG CCU CCC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG Ser Ser Ser Ser Pro Pro Pro Pro Thr Thr Thr Thr Ala Ala Ala Ala UAU UAC UAA UAG CAU CAC CAA CAG AAU AAC AAA AAG GAU GAC GAA GAG PSSM for Markov third position? Third order Chain Tyr Tyr ochre amber His His Gln Gln Asn Asn Lys Lys Asp Asp Glu Glu UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG Cys Cys opal Trp Arg Arg Arg Arg Ser Ser Arg Arg Gly Gly Gly Gly