Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Welcome to BNFO 601 Integrated Bioinformatics The villains gallery: Paul Fawcett [email protected] Jeff Elhai [email protected] Welcome to BNFO 601 Course Organization www.vcu.edu/csbc/bnfo601 Scientific problems bioinformatic solutions! For each topic: •Lecture and web supplements •Discussion and computer time •Problem sets Focus on principals, not soon-to-be obsolete software!! Optional Textbook BNFO 601 “The Quick Python Book”, 2nd edition by Vernon L. Ceder. Manning Press Inc. If purchased new, also comes with e-book. What is bioinformatics, anyway? One reasonable definition: The study and application of computational and statistical methods for the management and analysis of biological information. This is a very broad definition - therefore there are many “flavours” of bioinformatics Why do we need bioinformatics? Year 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 BasePairs Sequences 680338 606 2274029 2427 3368765 4175 5204420 5700 9615371 9978 15514776 14584 23800000 20579 34762585 28791 49179285 39533 71947426 55627 101008486 78608 157152442 143492 217102462 215273 384939485 555694 651972984 1021211 1160300687 1765847 2008761784 2837897 3841163011 4864570 1110106628810106023 15849921438 14976310 28507990166 22318883 Available biological data is growing exponentially! Why do we need bioinformatics? Humans have a hard time finding patterns in large or complex data sets! MDVEEFLSRVDAGELVISLGDLSGAILSEVDLSGINLSG ANLSGLWKNLSTILSNTLWDIKEADALATIREIQDESNRA HALIALADKISLPPDLLSEALTVARVDEADCADALIALARKL PPDLLSEALATAAEREIQDEYFRTSTLIELKLPSVLSEALAAAREI QDEYFRASTLIADEYLAEKLPSVLSEALAASREIQFRADALRELAQ KLPPDLLSEALAAVREIQPEYLRADALIALVEKLPSVLSEALAAIREIQ DEYLHADALRELVQKLPPDLLGEVLAAATEIRGGYPHTNPLRELAEKLPP DLLSEALAAAREIQDESNRAHALRELAEKLPPDLLSEALTATREIQSEYHRAST Biological data can be EIQPKSNRADALIALAEKLPPDLLSEALAAIREIQDESNRAHALIALAEKLPPDLLSEALA AIREIQDESNRAHALIALAQKLPPDLLSEALAATREIQSKSNRVHALIALAQKLPSVLPEA LAAATEIQDESNRASTLRELAEKLPPDLLSEALAAIREIQPKSNRVHALIALAQKLPSVLPEAL confusing! LRALAQKLPPDLLSEALAAAREIQDESNRASTLRELAEKLPSVLPEALAAVRKIRH KSNRAYGLIALAEKLPSVLPEALAAATEIEPEYHRASTLRELAEKLPPDLLSELTAIS AAIREIHHEYHRDNALRELAEKLPPNLLSEALAVIREIHYESNRTNALIALAKKLPSVLPEALA AVRKIRDKSNRIYALRELADKLPSVLPEALATAREIHDESYRADALKELAEKLPPDLLSEALTAIREIH DESYRADALIALAEKLPSVLPEALAAATVIRPESYRADALRDLAQKLPPDLLSEALAAIREIQSESNR AHALIALAEKMSLHNPSLSNVSANCVNLNHSTLTEAKLNQSDLRYGNLKGANLNKANLSRAFLNHADLSN TMLAQSNLSGTNLRNANLRNANLIMREEIRKVNQSLGESPKFGPFTGRQFVIFAGIFCIVFGLLCLIIGLDIFW GLGFAFWSSFSVALLSGDQPYIYWSKVYPIVPRWTRGYATYTSPHLKKKVGTRKVKLTRSSKPKTLNPFEDWLDLTT IVRLKKDAYTVGAYLLSKKNLTDSNNTLQLIFGFSCTGIHPLFNSEQEIEAVAKIFESGCKEIPPGEKI TFRWSSFCDDSD AEQYLMQRINNSSSLECEFLDWGRLARTQKLTNQRARKDIKLNIYWSFTVSSEALETSDPVDKFLAKLANFVQRRFTDSGVNQL TKKRFTQILTKALEASLRYQQILTEMGLNPQPKTDKDLWQELCKNIGAKTVIAPHTLVFDEQGVREEIDEKAVFDKP IEIINQPHLSSIILNNG VPFADKRWICLPTGENKKFVGVMVLTRKPEIFASTKHQIRFLWDLFSRNNIFDVEIITEFSPADRGITRAAQQMITKRSRALDLNVQQKKSIDVSAQINVE RSVEAQRQLYTGDVPLNLSLVVLVYRDTPEEIDDACRLISGYISQPTELTREVEYAWLIWLQTLLIRLEPILLRPYNRRLTFFASEIL GLTNIVQNSPADEQGF ELIADESDSPLHLDLSKTKNILILGTTGSGKSVLVSSIIGECQAQDMSVLMIDLPNDDGTGTFGDYTPYHNGFYFDISKESNNLVQPLDLSKIPPDE WEDRLQAHRNDVNLIVLQ LVLGSQTFDGFLSQTIESLIPLGTKAFYDHADIQRRFAKAKKDGLGSAAWDDTPTLADMERFFSKEHISLGYEDENVDRALNYIRLRFQYWRNSSIGNAICRPSTFDTDAKLITFALTNLQSSKDA EVFGMSAYIAASRQSLSAPNSVFFMDEASVLLRFAALSRLVGRKCATARKGGCRVMLAAQDILSIANSEAGEQILQNMPCRLIGRIVPGAAKSFTEHLGIPKDIIDKNESFRPNIKQLYTLWLLDY NN MDVEEFLSRVDAGELVISLGDLSGAILSEVDLSGINLSG ANLSGLWKNLSTILSNTLWDIKEADALATIREIQDESNRA HALIALADKISLPPDLLSEALTVARVDEADCADALIALARKL PPDLLSEALATAAEREIQDEYFRTSTLIELKLPSVLSEALAAAREI QDEYFRASTLIADEYLAEKLPSVLSEALAASREIQFRADALRELAQ KLPPDLLSEALAAVREIQPEYLRADALIALVEKLPSVLSEALAAIREIQ DEYLHADALRELVQKLPPDLLGEVLAAATEIRGGYPHTNPLRELAEKLPP DLLSEALAAAREIQDESNRAHALRELAEKLPPDLLSEALTATREIQSEYHRAST But is rich in information EIQPKSNRADALIALAEKLPPDLLSEALAAIREIQDESNRAHALIALAEKLPPDLLSEALA AIREIQDESNRAHALIALAQKLPPDLLSEALAATREIQSKSNRVHALIALAQKLPSVLPEA LAAATEIQDESNRASTLRELAEKLPPDLLSEALAAIREIQPKSNRVHALIALAQKLPSVLPEAL content ! LRALAQKLPPDLLSEALAAAREIQDESNRASTLRELAEKLPSVLPEALAAVRKIRH KSNRAYGLIALAEKLPSVLPEALAAATEIEPEYHRASTLRELAEKLPPDLLSELTAIS AAIREIHHEYHRDNALRELAEKLPPNLLSEALAVIREIHYESNRTNALIALAKKLPSVLPEALA AVRKIRDKSNRIYALRELADKLPSVLPEALATAREIHDESYRADALKELAEKLPPDLLSEALTAIREIH DESYRADALIALAEKLPSVLPEALAAATVIRPESYRADALRDLAQKLPPDLLSEALAAIREIQSESNR AHALIALAEKMSLHNPSLSNVSANCVNLNHSTLTEAKLNQSDLRYGNLKGANLNKANLSRAFLNHADLSN TMLAQSNLSGTNLRNANLRNANLIMREEIRKVNQSLGESPKFGPFTGRQFVIFAGIFCIVFGLLCLIIGLDIFW GLGFAFWSSFSVALLSGDQPYIYWSKVYPIVPRWTRGYATYTSPHLKKKVGTRKVKLTRSSKPKTLNPFEDWLDLTT IVRLKKDAYTVGAYLLSKKNLTDSNNTLQLIFGFSCTGIHPLFNSEQEIEAVAKIFESGCKEIPPGEKI TFRWSSFCDDSD AEQYLMQRINNSSSLECEFLDWGRLARTQKLTNQRARKDIKLNIYWSFTVSSEALETSDPVDKFLAKLANFVQRRFTDSGVNQL TKKRFTQILTKALEASLRYQQILTEMGLNPQPKTDKDLWQELCKNIGAKTVIAPHTLVFDEQGVREEIDEKAVFDKP IEIINQPHLSSIILNNG VPFADKRWICLPTGENKKFVGVMVLTRKPEIFASTKHQIRFLWDLFSRNNIFDVEIITEFSPADRGITRAAQQMITKRSRALDLNVQQKKSIDVSAQINVE RSVEAQRQLYTGDVPLNLSLVVLVYRDTPEEIDDACRLISGYISQPTELTREVEYAWLIWLQTLLIRLEPILLRPYNRRLTFFASEIL GLTNIVQNSPADEQGF ELIADESDSPLHLDLSKTKNILILGTTGSGKSVLVSSIIGECQAQDMSVLMIDLPNDDGTGTFGDYTPYHNGFYFDISKESNNLVQPLDLSKIPPDE WEDRLQAHRNDVNLIVLQ LVLGSQTFDGFLSQTIESLIPLGTKAFYDHADIQRRFAKAKKDGLGSAAWDDTPTLADMERFFSKEHISLGYEDENVDRALNYIRLRFQYWRNSSIGNAICRPSTFDTDAKLITFALTNLQSSKDA EVFGMSAYIAASRQSLSAPNSVFFMDEASVLLRFAALSRLVGRKCATARKGGCRVMLAAQDILSIANSEAGEQILQNMPCRLIGRIVPGAAKSFTEHLGIPKDIIDKNESFRPNIKQLYTLWLLDY NN Where did bioinformatics come from? Evolved, but is distinct from, the intellectual traditions of: •Genetics •Biochemistry •Molecular Biology •Computer Science •Probability & Statistics •Genomics Pre-genomic Molecular Biology The cell as a factory Pre-genomic Molecular Biology Pre-genomic Molecular Biology Pre-genomic Molecular Biology Pre-genomic Molecular Biology Pre-genomic Molecular Biology The cell as a “Black box” Pre-genomic Molecular Biology How do we figure out how cars are made? Genetic approach Biochemical approach Pre-genomic Molecular Biology Biochemist’s Approach Pre-genomic Molecular Biology Biochemist’s Approach Pre-genomic Molecular Biology Biochemist’s Approach Pre-genomic Molecular Biology Biochemist’s Approach An inherently reductionist approach! Pre-genomic Molecular Biology How do we figure out how cars are made? Genetic approach Biochemical approach Pre-genomic Molecular Biology Geneticist’s Approach Pre-genomic Molecular Biology Geneticist’s Approach Pre-genomic Molecular Biology Geneticist’s Approach Isolation of a Defective Gene Pre-genomic Molecular Biology How we viewed the world • One component at a time • Highly filtered perception • Many local viewpoints • Subject to ascertainment bias Post-genomic Molecular Biology A major goal is to achieve a synoptic, integrated understanding of cell function Post-genomic Molecular Biology Bioinformaticist’s Approach (short term) Identify critical parts Post-genomic Molecular Biology Bioinformaticist’s Approach (long term) Assemble the whole Genomics TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA What is Bioinformatics ? TGAGACACATATTTTTGATATTCCAGTTGTTGCAATC GAATGTAAAACATATTTAGATCTTTAAATGTATGGTAC ATTCAAGATCCAACCTTCATTCTAGTGTTTAAAGAGAAC TGATTTGTTTGCAGGGGCAGGAGGCTTTGGTTTAGGTTTTG AAATGGCAGGCTTCTCTGTACCTTTATCTGTTGAAATTGATACCT GGGCTTGTGATACACTACGCTACAACCGCCCTGATTCAACAGTTATT CAAAATGATATCGGTAACTTTAGTACAGAAAATGACGTTAAGAATATCT GCAACTTTAAACCTGATATTATTATTGGCGGGCCTCCATGCCAGGGATTTAG TATTGCTGGGCCAGCCCAAAAAGATCCTAAAGATCCTAGAAATGG What is AATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAATCCTCATAAAACT TTTATTCATCAACTTTGCACAATGGATAAAATTTCTTGAACCTAAAGCGTTTGTC ATG genomic GAAAACGTAAAAGGATTGCTATCAAGGAAAAATGCAGAAGGTTTTAAAGTTATAGATA CTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCA data? AGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCT TTATTAAGAAAACATTTGAAGAACTTGGTTATTTTGTCGAAGTATGGGTTTTAAATGCTGCGGAAT ATGGCATTCCGCAAATTAGAGAACGTATTTTTATTGTTGGCAATAAAAAAGGTAAAGTACTAGGTATGAG TATTATACCTGCACTAACTTTGTGGGACGCAATATCAGACTTACCAGAACTTAATGCGCGTGAAGGAAGT GAAGAGCAACCCTATCATTTAAAACCTCAAAATACTTATCAGACTTGGGCTAGAAATGGTAGTGCTACGCTTTAC AATCATGTTGCAATGGAACATTCTGACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGAATCCAGTTCGG ATGTATCTAAAGAACATGGAGCTAGACGACGTAGTGGTAATGGTGAATTATCAAACAAATCATATGATCAGAATAATCGCCG TTTAAATCCTCATAAACCGTCTCACACTATTGCTGCGTCATTCTATGCTAATTTTGTCCATCCTTTTCAACATC GAAATTTAACAGCCC GTGAAGGAGCTAGAATCCAATCTTTTCCAGATAACTATAGATTTTTTGGAAAAAAAACTGTCGTATCTCATAAACTATTGCATCGAGAAGAAAGATT TGATGAAAAATTTCTTTGTCAATATAATCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTAAAGTAATTGCACATCATCTTCTAGAGAAATTAGAGTTATGCCAACAA CTGATAGAAATCCTCTAGTGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAATACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAA TCAGAACTGAATATGACAAATG GCATAAAGCAAATATGAACCTGGTTGGACCAAAATCAGAAATTACTGACCAAGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTG ATTCAAGATCCAACCTTCATTCTAGTGTTTTAGAGACCATTTATAAAGTAAATCTTTAGACGACTAGACGACGTAGCATAATACGAGTCATAACGGCATATATGGCAGCCTCACTCATTTCTGGGAGACGCTCATAAT CCTTACTGAGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCATCAGCCAAACAGAGAGCGCAAATTTATCACCGTCATAGCCGGAATCAACCCAGATGACTTCAACTTTTTCCAGTAATTC TGGACGCTCTTCTAACAGTTCCATCAAAGTATAGGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACAT CCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGA ACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAC AACCTGTTTTCAGATGGTAGTAGATAGCGTT GCATACTTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGGTCTGTAGAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTA CTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGATTTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGT GCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAGCGAGAAATCCTAACAGTTTATACCTTGTGGTTATGGAATGGATAAAACTGACCAATGATGTAAATTTACGAAAATATAAAGTTGATCAAATTTATGTACTACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACATAAAAAAT Partial Hierarchy of Genomic Data • DNA Sequences • Contigs of assembled sequences • Predicted introns, exons, promoters, etc. • Genes • RNA sequences • Predicted gene products, proteins • Chromosomes • Genome Sequence Analysis is therefore a fundamental component of bioinformatics! E. coli: What makes it kill? Escherichia coli . . . . . very small lab rats Courtesy of Kent State University Microbiology E. coli: What makes it kill? Escherichia coli . . . haemorrhagic colitis E. coli: What makes it kill? E. coli K12 TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA Gene finder E. coli O157:H7 TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA Gene finder E. coli: What makes it kill? E. coli K12 TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA Gene finder E. coli O157:H7 TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA Gene finder E. coli: What makes it kill? Killer protein Killer functions Membrane protein, sodium transporter Iron responsive transcriptional regulator Calcium-dependent protein kinase Unknown protein Unknown protein Similarity finder Unknown protein etc . . . ideas for new antibiotics Metabolomics Genomics TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA What is Bioinformatics ? Towards a Treatment for Sleeping Sickness Causative agent Trypanosoma brucei Prevalance ~66 million sufferers T. brucei swimming in blood Standard treatment Derivative of arsenic Photo courtesy of Center for Disease Control Image Library Towards a Treatment for Sleeping Sickness Trypanosomes Dependent on glycolysis Humans Dependent on glycolysis OR oxidative metabolism IDEA: Identify drug that selectively blocks glycolysis Towards a Treatment for Sleeping Sickness How to block glycolysis? • ~dozen enzyme targets • ~$1 billion per target • effective on enzyme = effective on trypanosomes Need a method to predict effectiveness!! Towards a Treatment for Sleeping Sickness Glucose ATP Glucose-6-phosphate Hexokinase d(G6P)/dt = k3[glucose][ATP] Rate of increase of metabolite Concentrations of metabolites Rate constant (property of enzyme) ADP Towards a Treatment for Sleeping Sickness Glucose ATP Glucose-6-phosphate Hexokinase d(G6P)/dt = k3[glucose][ATP] d(F6P)/dt = k4[G6P] d(FDP)/dt = k6[F6P][ATP] . . . ADP Model of glycolysis 12 10 8 6 4 2 0 0 d(pyruvate)/dt = k20[PEP][ADP] 5 10 time 15 20 Towards a Treatment for Sleeping Sickness Glucose ATP Glucose-6-phosphate Hexokinase d(G6P)/dt = k3[glucose][ATP] d(F6P)/dt = k4[G6P] d(FDP)/dt = k6[F6P][ATP] . . . ADP Model of glycolysis 12 10 8 6 4 2 0 0 d(pyruvate)/dt = k20[PEP][ADP] 5 10 time 15 20 Towards a Treatment for Sleeping Sickness Run model with different realities Which target enzyme best for ATP? Structure Metabolomics Genomics TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA What is bioinformatics ? Proteomics Systems Transcriptomic s What is bioinformatics, revisted How to extract biological meaning from overwhelming information A Walk in the Forest * Photo courtesy of www.webshots.com Observation * Photos courtesy of www.webshots.com and Peter Smallwood Observation * Photos courtesy of www.webshots.com and Peter Smallwood Observation * Photos courtesy of www.webshots.com and Peter Smallwood Observation * Photos courtesy of www.webshots.com and Peter Smallwood Experiment * Photos courtesy of www.webshots.com and Peter Smallwood Filters: Information reducers A squirrel filter! Filters: Information reducers A molecule filter Filters: Information reducers TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TATGAGGCAA CTCGGGAGCG CCTTTAGATG AGGCCGGAGG CCCCGGCCTA TTCCCTGGGC TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA TCACAGCATC CACGGCTCTA CAAGAAGGAG GTCAAGAACT AGGCTGCCTG TCGGCGGGAC AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA AGGTGACCTT AAGAGGCCCA GAAACAGCTC CTCCACCGGC TGCTATAAAT AGATAACATG CTAGTTCTTG TTATCTGTTT CACTAGTTTC TTAGATAAAC CTCCACGCCC ATATTAAAAA AATTAGCAAA CATTCTAGGG AAACAAGCTA ATTTCCTGGG AGCCAAGGAC TGACAGACAG ATTGAACCCT AGTGCAGACA AGAAATGAGA AGTATCTATT TATCCAGGCA GAAATCCCTG GGCAGCGGCC ACGCGGCCCA AATGTGCCCT A sequence filter CTCCGTAAAC CTCTAAC... How organism is made How organism works From Sequence to Organism How does Nature do it? Active site ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of folding From Sequence to Organism How does Nature do it? Active site ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of folding Metabolism, Architecture Cell interaction From Sequence to Organism How does Nature do it? Active site ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Rules of folding • Custom antibiotics Gives us: From Sequence to Organism How does Nature do it? Active site ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code Gives us: Rules of folding • Custom antibiotics • Custom antibodies • Custom enzymes • New materials From Sequence to Organism How does Nature do it? ? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA CTAGTTCTTG TTATCTGTTT CACTAGTTTC TTAGATAAAC CTCCACGCCC ATATTAAAAA AATTAGCAAA CATTCTAGGG AAACAAGCTA ATTTCCTGGG AGCCAAGGAC TGACAGACAG ATTGAACCCT AGTGCAGACA AGAAATGAGA 3% ATGACTTATGATCAACGCACAGGGCTA Rules of transcriptional and post-transcriptional control • Begin transcription • End transcription • Splice transcript • Begin translation From Sequence to Organism How does Nature do it? ? ATGACTTATGATCAACGCACAGGGCTA Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Genetic code TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA CTAGTTCTTG TTATCTGTTT CACTAGTTTC TTAGATAAAC CTCCACGCCC ATATTAAAAA AATTAGCAAA CATTCTAGGG AAACAAGCTA ATTTCCTGGG AGCCAAGGAC TGACAGACAG ATTGAACCCT AGTGCAGACA AGAAATGAGA 97% TCTACTTATATTCAATCCACAGGGCTA CACCTAGTTCTTGAAGAGTCTGTTGAA TGAACACATACATGGTTTATCTGTTTT TCTGTCTGCTCTGACCTCTGGCAGCTT TAGCCTGCCCCACTCTTAGATAAACGA ACCTTAGTGACTTCTGCTATACCAAAG TCTCCACGCCCCTCCGTAAACCTCTAA CATGATGTCAGCAAATATTAAAAATGA Rules of transcriptional and post-transcriptional control • Begin transcription • End transcription • Splice transcript • Begin translation From Sequence to Organism How does Nature do it? Natural filters/transformations • Selective transcription DNA • Selective processing • Translation • Folding Functional protein From Sequence to Organism How can we do it? Natural filters/transformations DNA Simulation of Nature Functional protein Surrogate Processes From Sequence to Organism How can we do it? Simulation of Nature “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” Utterance of W Shakespeare Utterance of George W Bush “We must give our military every tool and weapon it needs to prevail...” ??? From Sequence to Organism How can we do it? Surrogate Processes “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” Word frequency Utterance of W Shakespeare Utterance of George W Bush From Sequence to Organism How can we do it? Surrogate Processes “Whether ‘tis nobler in the mind to suffer the slings and arrows of outrageous fortune...” “We must give our military every tool and weapon it needs to prevail...” Word frequency, words/sentence… Utterance of W Shakespeare Utterance of George W Bush From Sequence to Organism How can we do it? Natural filters/transformations Surrogate filters • Selective transcription • Gene finders • Selective processing • Translation • Folding/function TCTACTTATA CTAGTTCTTG CATACATGGT TCTGACCTCT TGGATTTCGG TTCAATCCAC AAGAGTCTGT TTATCTGTTT GGCAGCTTTC AACTCTAGCC AGGGCTACAC TGAATGAACA TTCTGTCTGC CACTAGTTTC TGCCCCACTC Predicted coding regions My sequence Characteristics of coding sequences/introns From Sequence to Organism How can we do it? Natural filters/transformations Surrogate filters • Selective transcription • Gene finders • Selective processing • Translation • Folding/function Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu... Function? From Sequence to Organism How can we do it? Natural filters/transformations Surrogate filters • Selective transcription • Gene finders • Selective processing • Similarity finders • Translation • Folding/function globin? globin Sequence/motif databases My predicted gene Similar genes Surrogate Filters Gene finders Start/Stop codon search Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA) CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA CTC CAC GCC CCT CCG TAC ACC TCT AAC ATG ATC TCA GCA AAT ATT AAA AAT GAA TAA ACT TTG TGA CAT GTA CAA ATG GAA ATA TGC AA C TCC ACG CCC CTC CGT ACA CCT CTA ACA TGA TCT CAG CAA ATA TTA AAA ATG AAT AAA CTT TGT GAC ATG TAC AAA TGG AAA TAT GCA A CT CCA CGC CCC TCC GTA CAC CTC TAA CAT GAT CTC AGC AAA TAT TAA AAA TGA ATA AAC TTT GTG ACA TGT ACA AAT GGA AAT ATG CAA Surrogate Filters Gene finders Start/Stop codon search Look for start codons (ATG) (GTG,TTG) Look for stop codons (TAA,TAG,TGA) CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATTTGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG Highly inaccurate Surrogate Filters Gene finders Markov Model based recognition Step 1: Create model through extensive training set Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT AAA AAC AAG AAT ACA ... TTG TTT Surrogate Filters Gene finders Class 3: Markov Model - based recognition Step 1: Create model through extensive training set AAAA: 33% Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT AAA AAC AAG AAT ACA ... TTG TTT AAAC: 25% AAAG: 12% AAAT: 30% Surrogate Filters Gene finders Class 3: Markov Model - based recognition Step 1: Create model through extensive training set Training Set AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT AAA AAC AAG AAT ACA ... TTG TTT AACA: 30% AACC: 20% AACG: 15% AACT: 35% Surrogate Filters Gene finders Class 3: Markov Model - based recognition Step 2: Assess candidate genes 3rd order Markov model Candidate gene AAAGCAA… A 0.33 0.30 0.35 0.30 0.25 C 0.25 0.20 0.15 0.15 0.20 G 0.12 0.15 0.20 0.20 0.15 T 0.30 0.35 0.30 0.25 0.35 AAA AAC AAG AAT ACA ... TTG 0.25 0.30 0.15 0.30 TTT 0.30 0.25 0.10 0.35 0.12 Surrogate Filters Gene finders Class 3: Markov Model - based recognition Step 2: Assess candidate genes 3rd order Markov model Candidate gene AAAGCAA… A 0.33 0.30 0.35 0.30 0.25 C 0.25 0.20 0.15 0.15 0.20 G 0.12 0.15 0.20 0.20 0.15 T 0.30 0.35 0.30 0.25 0.35 AAA AAC AAG AAT ACA ... TTG 0.25 0.30 0.15 0.30 TTT 0.30 0.25 0.10 0.35 0.12 x 0.15 Surrogate Filters Gene finders Class 3: Markov Model - based recognition Step 2: Assess candidate genes 3rd order Markov model Candidate gene AAAGCTA… A 0.33 0.30 0.35 0.30 0.25 C 0.25 0.20 0.15 0.15 0.20 G 0.12 0.15 0.20 0.20 0.15 T 0.30 0.35 0.30 0.25 0.35 AAA AAC AAG AAT ACA ... TTG 0.25 0.30 0.15 0.30 TTT 0.30 0.25 0.10 0.35 So far, not a good candidate! 0.12 x 0.15 . . . Surrogate Filters Gene finders Class 3: Markov Model - based recognition Step 2: Assess candidate genes 3rd order Markov model Candidate genes Predicted genes Surrogate Filters Gene finders Class 3: Markov Model - based recognition Step 2: Assess candidate genes 3rd order Markov model Challenge accepted beliefs Candidate genes Conform to standard model Predicted genes Computers are an ideal tool Highly filtered output • Easy to grasp • High-level insights Unfiltered output • Confusing • Basic insights The Crisis in Bioinformatics 1. Need high-level filters 2. Need access to raw phenomena 3. Need new tools for new phenomena 4. Need ability to build new tools Need a new generation!! AATAAAGCTTTACAAACCAA Future Biology ACTCTGGCTTCAATTGTGTAA CCCAAGCTTTGATTCTTTCCT CTGTTAAATCGGATTGATTAT CTTCATCAAGGGCAAGACCT ACAAATTTACCATCACGAAC AGCTTTAGACTCACTGAATT CATAACCTTCTGTAGGCCAA TAGCCAACTGTTTCACCACC How to get there? Molecular Biology Computer Programming Bioinformatics Statistics How to get there? The Challenge • Some expert molecular biologists • Some master programmers • Some knowledgeable in the statistical arts • Most have little experience with bioinformatic tools Overall goals of the course How to get there? Computer programming Goals of course • Be able to understand well-written programs in Python • Be able to modify working programs • Gain increasing skill in writing programs from scratch • Analyze problems from both biological and CS perspectives