Download Presentation

Document related concepts
no text concepts found
Transcript
Welcome to BNFO 601
Integrated Bioinformatics
The villains gallery:
Paul Fawcett
[email protected]
Jeff Elhai
[email protected]
Welcome to BNFO 601
Course Organization
www.vcu.edu/csbc/bnfo601
Scientific problems bioinformatic solutions!
For each topic:
•Lecture and web supplements
•Discussion and computer time
•Problem sets
Focus on principals, not soon-to-be obsolete software!!
Optional Textbook BNFO 601
“The Quick Python Book”, 2nd edition
by Vernon L. Ceder.
Manning Press Inc.
If purchased new, also comes with e-book.
What is bioinformatics, anyway?
One reasonable definition:
The study and application of
computational and statistical
methods for the management and
analysis of biological information.
This is a very broad definition - therefore
there are many “flavours” of bioinformatics
Why do we need bioinformatics?
Year
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
BasePairs
Sequences
680338
606
2274029
2427
3368765
4175
5204420
5700
9615371
9978
15514776
14584
23800000
20579
34762585
28791
49179285
39533
71947426
55627
101008486
78608
157152442
143492
217102462
215273
384939485
555694
651972984
1021211
1160300687
1765847
2008761784
2837897
3841163011
4864570
1110106628810106023
15849921438
14976310
28507990166
22318883
Available biological data is growing exponentially!
Why do we need bioinformatics?
Humans have a hard time finding patterns in large or complex data sets!
MDVEEFLSRVDAGELVISLGDLSGAILSEVDLSGINLSG
ANLSGLWKNLSTILSNTLWDIKEADALATIREIQDESNRA
HALIALADKISLPPDLLSEALTVARVDEADCADALIALARKL
PPDLLSEALATAAEREIQDEYFRTSTLIELKLPSVLSEALAAAREI
QDEYFRASTLIADEYLAEKLPSVLSEALAASREIQFRADALRELAQ
KLPPDLLSEALAAVREIQPEYLRADALIALVEKLPSVLSEALAAIREIQ
DEYLHADALRELVQKLPPDLLGEVLAAATEIRGGYPHTNPLRELAEKLPP
DLLSEALAAAREIQDESNRAHALRELAEKLPPDLLSEALTATREIQSEYHRAST
Biological
data can be
EIQPKSNRADALIALAEKLPPDLLSEALAAIREIQDESNRAHALIALAEKLPPDLLSEALA
AIREIQDESNRAHALIALAQKLPPDLLSEALAATREIQSKSNRVHALIALAQKLPSVLPEA
LAAATEIQDESNRASTLRELAEKLPPDLLSEALAAIREIQPKSNRVHALIALAQKLPSVLPEAL
confusing!
LRALAQKLPPDLLSEALAAAREIQDESNRASTLRELAEKLPSVLPEALAAVRKIRH
KSNRAYGLIALAEKLPSVLPEALAAATEIEPEYHRASTLRELAEKLPPDLLSELTAIS
AAIREIHHEYHRDNALRELAEKLPPNLLSEALAVIREIHYESNRTNALIALAKKLPSVLPEALA
AVRKIRDKSNRIYALRELADKLPSVLPEALATAREIHDESYRADALKELAEKLPPDLLSEALTAIREIH
DESYRADALIALAEKLPSVLPEALAAATVIRPESYRADALRDLAQKLPPDLLSEALAAIREIQSESNR
AHALIALAEKMSLHNPSLSNVSANCVNLNHSTLTEAKLNQSDLRYGNLKGANLNKANLSRAFLNHADLSN
TMLAQSNLSGTNLRNANLRNANLIMREEIRKVNQSLGESPKFGPFTGRQFVIFAGIFCIVFGLLCLIIGLDIFW
GLGFAFWSSFSVALLSGDQPYIYWSKVYPIVPRWTRGYATYTSPHLKKKVGTRKVKLTRSSKPKTLNPFEDWLDLTT
IVRLKKDAYTVGAYLLSKKNLTDSNNTLQLIFGFSCTGIHPLFNSEQEIEAVAKIFESGCKEIPPGEKI TFRWSSFCDDSD
AEQYLMQRINNSSSLECEFLDWGRLARTQKLTNQRARKDIKLNIYWSFTVSSEALETSDPVDKFLAKLANFVQRRFTDSGVNQL
TKKRFTQILTKALEASLRYQQILTEMGLNPQPKTDKDLWQELCKNIGAKTVIAPHTLVFDEQGVREEIDEKAVFDKP IEIINQPHLSSIILNNG
VPFADKRWICLPTGENKKFVGVMVLTRKPEIFASTKHQIRFLWDLFSRNNIFDVEIITEFSPADRGITRAAQQMITKRSRALDLNVQQKKSIDVSAQINVE
RSVEAQRQLYTGDVPLNLSLVVLVYRDTPEEIDDACRLISGYISQPTELTREVEYAWLIWLQTLLIRLEPILLRPYNRRLTFFASEIL GLTNIVQNSPADEQGF
ELIADESDSPLHLDLSKTKNILILGTTGSGKSVLVSSIIGECQAQDMSVLMIDLPNDDGTGTFGDYTPYHNGFYFDISKESNNLVQPLDLSKIPPDE WEDRLQAHRNDVNLIVLQ
LVLGSQTFDGFLSQTIESLIPLGTKAFYDHADIQRRFAKAKKDGLGSAAWDDTPTLADMERFFSKEHISLGYEDENVDRALNYIRLRFQYWRNSSIGNAICRPSTFDTDAKLITFALTNLQSSKDA
EVFGMSAYIAASRQSLSAPNSVFFMDEASVLLRFAALSRLVGRKCATARKGGCRVMLAAQDILSIANSEAGEQILQNMPCRLIGRIVPGAAKSFTEHLGIPKDIIDKNESFRPNIKQLYTLWLLDY
NN
MDVEEFLSRVDAGELVISLGDLSGAILSEVDLSGINLSG
ANLSGLWKNLSTILSNTLWDIKEADALATIREIQDESNRA
HALIALADKISLPPDLLSEALTVARVDEADCADALIALARKL
PPDLLSEALATAAEREIQDEYFRTSTLIELKLPSVLSEALAAAREI
QDEYFRASTLIADEYLAEKLPSVLSEALAASREIQFRADALRELAQ
KLPPDLLSEALAAVREIQPEYLRADALIALVEKLPSVLSEALAAIREIQ
DEYLHADALRELVQKLPPDLLGEVLAAATEIRGGYPHTNPLRELAEKLPP
DLLSEALAAAREIQDESNRAHALRELAEKLPPDLLSEALTATREIQSEYHRAST
But is rich in
information
EIQPKSNRADALIALAEKLPPDLLSEALAAIREIQDESNRAHALIALAEKLPPDLLSEALA
AIREIQDESNRAHALIALAQKLPPDLLSEALAATREIQSKSNRVHALIALAQKLPSVLPEA
LAAATEIQDESNRASTLRELAEKLPPDLLSEALAAIREIQPKSNRVHALIALAQKLPSVLPEAL
content !
LRALAQKLPPDLLSEALAAAREIQDESNRASTLRELAEKLPSVLPEALAAVRKIRH
KSNRAYGLIALAEKLPSVLPEALAAATEIEPEYHRASTLRELAEKLPPDLLSELTAIS
AAIREIHHEYHRDNALRELAEKLPPNLLSEALAVIREIHYESNRTNALIALAKKLPSVLPEALA
AVRKIRDKSNRIYALRELADKLPSVLPEALATAREIHDESYRADALKELAEKLPPDLLSEALTAIREIH
DESYRADALIALAEKLPSVLPEALAAATVIRPESYRADALRDLAQKLPPDLLSEALAAIREIQSESNR
AHALIALAEKMSLHNPSLSNVSANCVNLNHSTLTEAKLNQSDLRYGNLKGANLNKANLSRAFLNHADLSN
TMLAQSNLSGTNLRNANLRNANLIMREEIRKVNQSLGESPKFGPFTGRQFVIFAGIFCIVFGLLCLIIGLDIFW
GLGFAFWSSFSVALLSGDQPYIYWSKVYPIVPRWTRGYATYTSPHLKKKVGTRKVKLTRSSKPKTLNPFEDWLDLTT
IVRLKKDAYTVGAYLLSKKNLTDSNNTLQLIFGFSCTGIHPLFNSEQEIEAVAKIFESGCKEIPPGEKI TFRWSSFCDDSD
AEQYLMQRINNSSSLECEFLDWGRLARTQKLTNQRARKDIKLNIYWSFTVSSEALETSDPVDKFLAKLANFVQRRFTDSGVNQL
TKKRFTQILTKALEASLRYQQILTEMGLNPQPKTDKDLWQELCKNIGAKTVIAPHTLVFDEQGVREEIDEKAVFDKP IEIINQPHLSSIILNNG
VPFADKRWICLPTGENKKFVGVMVLTRKPEIFASTKHQIRFLWDLFSRNNIFDVEIITEFSPADRGITRAAQQMITKRSRALDLNVQQKKSIDVSAQINVE
RSVEAQRQLYTGDVPLNLSLVVLVYRDTPEEIDDACRLISGYISQPTELTREVEYAWLIWLQTLLIRLEPILLRPYNRRLTFFASEIL GLTNIVQNSPADEQGF
ELIADESDSPLHLDLSKTKNILILGTTGSGKSVLVSSIIGECQAQDMSVLMIDLPNDDGTGTFGDYTPYHNGFYFDISKESNNLVQPLDLSKIPPDE WEDRLQAHRNDVNLIVLQ
LVLGSQTFDGFLSQTIESLIPLGTKAFYDHADIQRRFAKAKKDGLGSAAWDDTPTLADMERFFSKEHISLGYEDENVDRALNYIRLRFQYWRNSSIGNAICRPSTFDTDAKLITFALTNLQSSKDA
EVFGMSAYIAASRQSLSAPNSVFFMDEASVLLRFAALSRLVGRKCATARKGGCRVMLAAQDILSIANSEAGEQILQNMPCRLIGRIVPGAAKSFTEHLGIPKDIIDKNESFRPNIKQLYTLWLLDY
NN
Where did bioinformatics come from?
Evolved, but is distinct from,
the intellectual traditions of:
•Genetics
•Biochemistry
•Molecular Biology
•Computer Science
•Probability & Statistics
•Genomics
Pre-genomic Molecular Biology
The cell as a factory
Pre-genomic Molecular Biology
Pre-genomic Molecular Biology
Pre-genomic Molecular Biology
Pre-genomic Molecular Biology
Pre-genomic Molecular Biology
The cell as a “Black box”
Pre-genomic Molecular Biology
How do we figure out how cars are made?
Genetic approach
Biochemical approach
Pre-genomic Molecular Biology
Biochemist’s Approach
Pre-genomic Molecular Biology
Biochemist’s Approach
Pre-genomic Molecular Biology
Biochemist’s Approach
Pre-genomic Molecular Biology
Biochemist’s Approach
An inherently reductionist approach!
Pre-genomic Molecular Biology
How do we figure out how cars are made?
Genetic approach
Biochemical approach
Pre-genomic Molecular Biology
Geneticist’s Approach
Pre-genomic Molecular Biology
Geneticist’s Approach
Pre-genomic Molecular Biology
Geneticist’s Approach
Isolation of a Defective Gene
Pre-genomic Molecular Biology
How we viewed the world
• One component at a time
• Highly filtered perception
• Many local viewpoints
• Subject to ascertainment bias
Post-genomic Molecular Biology
A major goal is to achieve a synoptic,
integrated understanding of cell function
Post-genomic Molecular Biology
Bioinformaticist’s Approach
(short term)
Identify critical parts
Post-genomic Molecular Biology
Bioinformaticist’s Approach
(long term)
Assemble the whole
Genomics
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
What is
Bioinformatics
?
TGAGACACATATTTTTGATATTCCAGTTGTTGCAATC
GAATGTAAAACATATTTAGATCTTTAAATGTATGGTAC
ATTCAAGATCCAACCTTCATTCTAGTGTTTAAAGAGAAC
TGATTTGTTTGCAGGGGCAGGAGGCTTTGGTTTAGGTTTTG
AAATGGCAGGCTTCTCTGTACCTTTATCTGTTGAAATTGATACCT
GGGCTTGTGATACACTACGCTACAACCGCCCTGATTCAACAGTTATT
CAAAATGATATCGGTAACTTTAGTACAGAAAATGACGTTAAGAATATCT
GCAACTTTAAACCTGATATTATTATTGGCGGGCCTCCATGCCAGGGATTTAG
TATTGCTGGGCCAGCCCAAAAAGATCCTAAAGATCCTAGAAATGG
What is
AATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAATCCTCATAAAACT
TTTATTCATCAACTTTGCACAATGGATAAAATTTCTTGAACCTAAAGCGTTTGTC
ATG
genomic
GAAAACGTAAAAGGATTGCTATCAAGGAAAAATGCAGAAGGTTTTAAAGTTATAGATA
CTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCA
data?
AGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCT
TTATTAAGAAAACATTTGAAGAACTTGGTTATTTTGTCGAAGTATGGGTTTTAAATGCTGCGGAAT
ATGGCATTCCGCAAATTAGAGAACGTATTTTTATTGTTGGCAATAAAAAAGGTAAAGTACTAGGTATGAG
TATTATACCTGCACTAACTTTGTGGGACGCAATATCAGACTTACCAGAACTTAATGCGCGTGAAGGAAGT
GAAGAGCAACCCTATCATTTAAAACCTCAAAATACTTATCAGACTTGGGCTAGAAATGGTAGTGCTACGCTTTAC
AATCATGTTGCAATGGAACATTCTGACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGAATCCAGTTCGG
ATGTATCTAAAGAACATGGAGCTAGACGACGTAGTGGTAATGGTGAATTATCAAACAAATCATATGATCAGAATAATCGCCG
TTTAAATCCTCATAAACCGTCTCACACTATTGCTGCGTCATTCTATGCTAATTTTGTCCATCCTTTTCAACATC GAAATTTAACAGCCC
GTGAAGGAGCTAGAATCCAATCTTTTCCAGATAACTATAGATTTTTTGGAAAAAAAACTGTCGTATCTCATAAACTATTGCATCGAGAAGAAAGATT
TGATGAAAAATTTCTTTGTCAATATAATCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTAAAGTAATTGCACATCATCTTCTAGAGAAATTAGAGTTATGCCAACAA
CTGATAGAAATCCTCTAGTGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAATACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAA TCAGAACTGAATATGACAAATG
GCATAAAGCAAATATGAACCTGGTTGGACCAAAATCAGAAATTACTGACCAAGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTG
ATTCAAGATCCAACCTTCATTCTAGTGTTTTAGAGACCATTTATAAAGTAAATCTTTAGACGACTAGACGACGTAGCATAATACGAGTCATAACGGCATATATGGCAGCCTCACTCATTTCTGGGAGACGCTCATAAT
CCTTACTGAGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCATCAGCCAAACAGAGAGCGCAAATTTATCACCGTCATAGCCGGAATCAACCCAGATGACTTCAACTTTTTCCAGTAATTC
TGGACGCTCTTCTAACAGTTCCATCAAAGTATAGGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACAT
CCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGA ACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAC AACCTGTTTTCAGATGGTAGTAGATAGCGTT
GCATACTTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGGTCTGTAGAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTA
CTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGATTTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGT
GCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAGCGAGAAATCCTAACAGTTTATACCTTGTGGTTATGGAATGGATAAAACTGACCAATGATGTAAATTTACGAAAATATAAAGTTGATCAAATTTATGTACTACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACATAAAAAAT
Partial Hierarchy of Genomic Data
• DNA Sequences
• Contigs of assembled sequences
• Predicted introns, exons, promoters, etc.
• Genes
• RNA sequences
• Predicted gene products, proteins
• Chromosomes
• Genome
Sequence Analysis is therefore a
fundamental component of bioinformatics!
E. coli: What makes it kill?
Escherichia coli . . .
. . very small lab rats
Courtesy of Kent State University Microbiology
E. coli: What makes it kill?
Escherichia coli . . .
haemorrhagic colitis
E. coli: What makes it kill?
E. coli K12
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
Gene finder
E. coli O157:H7
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
Gene finder
E. coli: What makes it kill?
E. coli K12
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
Gene finder
E. coli O157:H7
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
Gene finder
E. coli: What makes it kill?
Killer protein
Killer functions
Membrane protein, sodium transporter
Iron responsive transcriptional regulator
Calcium-dependent protein kinase
Unknown protein
Unknown protein
Similarity finder
Unknown protein
etc . . .
ideas for new antibiotics
Metabolomics
Genomics
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
What is
Bioinformatics
?
Towards a Treatment for Sleeping Sickness
Causative agent
Trypanosoma brucei
Prevalance
~66 million sufferers
T. brucei swimming in blood
Standard treatment
Derivative of arsenic
Photo courtesy of Center for Disease Control Image Library
Towards a Treatment for Sleeping Sickness
Trypanosomes
Dependent on glycolysis
Humans
Dependent on glycolysis
OR
oxidative metabolism
IDEA: Identify drug that selectively blocks glycolysis
Towards a Treatment for Sleeping Sickness
How to block glycolysis?
• ~dozen enzyme targets
• ~$1 billion per target
• effective on enzyme
=
effective on trypanosomes
Need a method to predict effectiveness!!
Towards a Treatment for Sleeping Sickness
Glucose
ATP
Glucose-6-phosphate
Hexokinase
d(G6P)/dt = k3[glucose][ATP]
Rate of increase
of metabolite
Concentrations
of metabolites
Rate constant
(property of enzyme)
ADP
Towards a Treatment for Sleeping Sickness
Glucose
ATP
Glucose-6-phosphate
Hexokinase
d(G6P)/dt = k3[glucose][ATP]
d(F6P)/dt = k4[G6P]
d(FDP)/dt = k6[F6P][ATP]
.
.
.
ADP
Model of glycolysis
12
10
8
6
4
2
0
0
d(pyruvate)/dt = k20[PEP][ADP]
5
10
time
15
20
Towards a Treatment for Sleeping Sickness
Glucose
ATP
Glucose-6-phosphate
Hexokinase
d(G6P)/dt = k3[glucose][ATP]
d(F6P)/dt = k4[G6P]
d(FDP)/dt = k6[F6P][ATP]
.
.
.
ADP
Model of glycolysis
12
10
8
6
4
2
0
0
d(pyruvate)/dt = k20[PEP][ADP]
5
10
time
15
20
Towards a Treatment for Sleeping Sickness
Run model with
different realities
Which target enzyme best for ATP?
Structure
Metabolomics
Genomics
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
What is
bioinformatics
?
Proteomics
Systems
Transcriptomic
s
What is bioinformatics, revisted
How to extract biological
meaning from overwhelming
information
A Walk in the Forest
* Photo courtesy of www.webshots.com
Observation
* Photos courtesy of www.webshots.com and Peter Smallwood
Observation
* Photos courtesy of www.webshots.com and Peter Smallwood
Observation
* Photos courtesy of www.webshots.com and Peter Smallwood
Observation
* Photos courtesy of www.webshots.com and Peter Smallwood
Experiment
* Photos courtesy of www.webshots.com and Peter Smallwood
Filters: Information
reducers
A squirrel filter!
Filters: Information
reducers
A molecule filter
Filters: Information reducers
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TATGAGGCAA
CTCGGGAGCG
CCTTTAGATG
AGGCCGGAGG
CCCCGGCCTA
TTCCCTGGGC
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
TCACAGCATC
CACGGCTCTA
CAAGAAGGAG
GTCAAGAACT
AGGCTGCCTG
TCGGCGGGAC
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
AGGTGACCTT
AAGAGGCCCA
GAAACAGCTC
CTCCACCGGC
TGCTATAAAT
AGATAACATG
CTAGTTCTTG
TTATCTGTTT
CACTAGTTTC
TTAGATAAAC
CTCCACGCCC
ATATTAAAAA
AATTAGCAAA
CATTCTAGGG
AAACAAGCTA
ATTTCCTGGG
AGCCAAGGAC
TGACAGACAG
ATTGAACCCT
AGTGCAGACA
AGAAATGAGA
AGTATCTATT
TATCCAGGCA
GAAATCCCTG
GGCAGCGGCC
ACGCGGCCCA
AATGTGCCCT
A sequence filter
CTCCGTAAAC
CTCTAAC...
How organism is made
How organism works
From Sequence to Organism
How does Nature do it?
Active site
ATGACTTATGATCAACGCACAGGGCTA
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Rules of folding
From Sequence to Organism
How does Nature do it?
Active site
ATGACTTATGATCAACGCACAGGGCTA
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Rules of folding
Metabolism,
Architecture
Cell interaction
From Sequence to Organism
How does Nature do it?
Active site
ATGACTTATGATCAACGCACAGGGCTA
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Rules of folding
• Custom antibiotics
Gives us:
From Sequence to Organism
How does Nature do it?
Active site
ATGACTTATGATCAACGCACAGGGCTA
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Gives us:
Rules of folding
• Custom antibiotics
• Custom antibodies
• Custom enzymes
• New materials
From Sequence to Organism
How does Nature do it?
?
ATGACTTATGATCAACGCACAGGGCTA
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
CTAGTTCTTG
TTATCTGTTT
CACTAGTTTC
TTAGATAAAC
CTCCACGCCC
ATATTAAAAA
AATTAGCAAA
CATTCTAGGG
AAACAAGCTA
ATTTCCTGGG
AGCCAAGGAC
TGACAGACAG
ATTGAACCCT
AGTGCAGACA
AGAAATGAGA
3%
ATGACTTATGATCAACGCACAGGGCTA
Rules of transcriptional and
post-transcriptional control
• Begin transcription
• End transcription
• Splice transcript
• Begin translation
From Sequence to Organism
How does Nature do it?
?
ATGACTTATGATCAACGCACAGGGCTA
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
CTAGTTCTTG
TTATCTGTTT
CACTAGTTTC
TTAGATAAAC
CTCCACGCCC
ATATTAAAAA
AATTAGCAAA
CATTCTAGGG
AAACAAGCTA
ATTTCCTGGG
AGCCAAGGAC
TGACAGACAG
ATTGAACCCT
AGTGCAGACA
AGAAATGAGA
97%
TCTACTTATATTCAATCCACAGGGCTA
CACCTAGTTCTTGAAGAGTCTGTTGAA
TGAACACATACATGGTTTATCTGTTTT
TCTGTCTGCTCTGACCTCTGGCAGCTT
TAGCCTGCCCCACTCTTAGATAAACGA
ACCTTAGTGACTTCTGCTATACCAAAG
TCTCCACGCCCCTCCGTAAACCTCTAA
CATGATGTCAGCAAATATTAAAAATGA
Rules of transcriptional and
post-transcriptional control
• Begin transcription
• End transcription
• Splice transcript
• Begin translation
From Sequence to Organism
How does Nature do it?
Natural filters/transformations
• Selective transcription
DNA
• Selective processing
• Translation
• Folding
Functional
protein
From Sequence to Organism
How can we do it?
Natural filters/transformations
DNA
Simulation of Nature
Functional
protein
Surrogate Processes
From Sequence to Organism
How can we do it?
Simulation of Nature
“Whether ‘tis nobler in the mind
to suffer the slings and arrows
of outrageous fortune...”
Utterance of
W Shakespeare
Utterance of
George W Bush
“We must give our military
every tool and weapon
it needs to prevail...”
???
From Sequence to Organism
How can we do it?
Surrogate Processes
“Whether ‘tis nobler in the mind
to suffer the slings and arrows
of outrageous fortune...”
“We must give our military
every tool and weapon
it needs to prevail...”
Word frequency
Utterance of
W Shakespeare
Utterance of
George W Bush
From Sequence to Organism
How can we do it?
Surrogate Processes
“Whether ‘tis nobler in the mind
to suffer the slings and arrows
of outrageous fortune...”
“We must give our military
every tool and weapon
it needs to prevail...”
Word frequency, words/sentence…
Utterance of
W Shakespeare
Utterance of
George W Bush
From Sequence to Organism
How can we do it?
Natural filters/transformations
Surrogate filters
• Selective transcription
• Gene finders
• Selective processing
• Translation
• Folding/function
TCTACTTATA
CTAGTTCTTG
CATACATGGT
TCTGACCTCT
TGGATTTCGG
TTCAATCCAC
AAGAGTCTGT
TTATCTGTTT
GGCAGCTTTC
AACTCTAGCC
AGGGCTACAC
TGAATGAACA
TTCTGTCTGC
CACTAGTTTC
TGCCCCACTC
Predicted coding regions
My sequence
Characteristics of
coding sequences/introns
From Sequence to Organism
How can we do it?
Natural filters/transformations
Surrogate filters
• Selective transcription
• Gene finders
• Selective processing
• Translation
• Folding/function
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Function?
From Sequence to Organism
How can we do it?
Natural filters/transformations
Surrogate filters
• Selective transcription
• Gene finders
• Selective processing
• Similarity finders
• Translation
• Folding/function
globin?
globin
Sequence/motif
databases
My predicted gene
Similar genes
Surrogate Filters
Gene finders
Start/Stop codon search
Look for start codons (ATG) (GTG,TTG)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
CTC CAC GCC CCT CCG TAC ACC TCT AAC ATG ATC TCA GCA AAT ATT AAA AAT GAA TAA ACT TTG TGA CAT GTA CAA ATG GAA ATA TGC AA
C TCC ACG CCC CTC CGT ACA CCT CTA ACA TGA TCT CAG CAA ATA TTA AAA ATG AAT AAA CTT TGT GAC ATG TAC AAA TGG AAA TAT GCA A
CT CCA CGC CCC TCC GTA CAC CTC TAA CAT GAT CTC AGC AAA TAT TAA AAA TGA ATA AAC TTT GTG ACA TGT ACA AAT GGA AAT ATG CAA
Surrogate Filters
Gene finders
Start/Stop codon search
Look for start codons (ATG) (GTG,TTG)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATTTGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG
Highly inaccurate
Surrogate Filters
Gene finders
Markov Model based recognition
Step 1: Create model through extensive training set
Training
Set
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC
AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA
CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT
GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC
TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT
ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG
TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA
TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA
TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA
TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC
AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT
GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA
CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT
GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG
TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
AAA
AAC
AAG
AAT
ACA
...
TTG
TTT
Surrogate Filters
Gene finders
Class 3: Markov Model - based recognition
Step 1: Create model through extensive training set
AAAA: 33%
Training
Set
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC
AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA
CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT
GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC
TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT
ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG
TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA
TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA
TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA
TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC
AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT
GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA
CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT
GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG
TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
AAA
AAC
AAG
AAT
ACA
...
TTG
TTT
AAAC: 25%
AAAG: 12%
AAAT: 30%
Surrogate Filters
Gene finders
Class 3: Markov Model - based recognition
Step 1: Create model through extensive training set
Training
Set
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC
AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA
CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT
GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC
TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT
ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG
TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA
TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA
TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA
TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC
AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT
GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA
CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT
GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG
TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
AAA
AAC
AAG
AAT
ACA
...
TTG
TTT
AACA: 30%
AACC: 20%
AACG: 15%
AACT: 35%
Surrogate Filters
Gene finders
Class 3: Markov Model - based recognition
Step 2: Assess candidate genes
3rd order Markov model
Candidate
gene
AAAGCAA…
A
0.33
0.30
0.35
0.30
0.25
C
0.25
0.20
0.15
0.15
0.20
G
0.12
0.15
0.20
0.20
0.15
T
0.30
0.35
0.30
0.25
0.35
AAA
AAC
AAG
AAT
ACA
...
TTG 0.25 0.30 0.15 0.30
TTT 0.30 0.25 0.10 0.35
0.12
Surrogate Filters
Gene finders
Class 3: Markov Model - based recognition
Step 2: Assess candidate genes
3rd order Markov model
Candidate
gene
AAAGCAA…
A
0.33
0.30
0.35
0.30
0.25
C
0.25
0.20
0.15
0.15
0.20
G
0.12
0.15
0.20
0.20
0.15
T
0.30
0.35
0.30
0.25
0.35
AAA
AAC
AAG
AAT
ACA
...
TTG 0.25 0.30 0.15 0.30
TTT 0.30 0.25 0.10 0.35
0.12 x 0.15
Surrogate Filters
Gene finders
Class 3: Markov Model - based recognition
Step 2: Assess candidate genes
3rd order Markov model
Candidate
gene
AAAGCTA…
A
0.33
0.30
0.35
0.30
0.25
C
0.25
0.20
0.15
0.15
0.20
G
0.12
0.15
0.20
0.20
0.15
T
0.30
0.35
0.30
0.25
0.35
AAA
AAC
AAG
AAT
ACA
...
TTG 0.25 0.30 0.15 0.30
TTT 0.30 0.25 0.10 0.35
So far, not a good candidate!
0.12 x 0.15 . . .
Surrogate Filters
Gene finders
Class 3: Markov Model - based recognition
Step 2: Assess candidate genes
3rd order Markov model
Candidate genes
Predicted genes
Surrogate Filters
Gene finders
Class 3: Markov Model - based recognition
Step 2: Assess candidate genes
3rd order Markov model
Challenge
accepted
beliefs
Candidate genes
Conform to
standard model
Predicted genes
Computers are an ideal tool
Highly filtered output
• Easy to grasp
• High-level insights
Unfiltered output
• Confusing
• Basic insights
The Crisis in Bioinformatics
1. Need high-level filters
2. Need access to raw phenomena
3. Need new tools for new phenomena
4. Need ability to build new tools
Need a new generation!!
AATAAAGCTTTACAAACCAA
Future Biology
ACTCTGGCTTCAATTGTGTAA
CCCAAGCTTTGATTCTTTCCT
CTGTTAAATCGGATTGATTAT
CTTCATCAAGGGCAAGACCT
ACAAATTTACCATCACGAAC
AGCTTTAGACTCACTGAATT
CATAACCTTCTGTAGGCCAA
TAGCCAACTGTTTCACCACC
How to get there?
Molecular
Biology
Computer
Programming
Bioinformatics
Statistics
How to get there?
The Challenge
• Some expert molecular biologists
• Some master programmers
• Some knowledgeable in the statistical arts
• Most have little experience with bioinformatic tools
Overall goals of the course
How to get there?
Computer programming
Goals of course
• Be able to understand well-written programs in Python
• Be able to modify working programs
• Gain increasing skill in writing programs from scratch
• Analyze problems from both biological and CS perspectives
Related documents