Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
What is computational biology? 1 Mona Singh Genome • The entire hereditary information content of an organism 2 Mona Singh DNA • String over 4 letter alphabet A, T, G, C • Organism’s genome is distributed over chromosomes (e.g., 46 chromosomes in human—22 pairs & XY) • Genome size: number of base pairs in an organism 3 Mona Singh Genome Sizes 4 Human 3 billion bps Mouse 3 billion bps Fruit fly 165 million bps Nematode worm 97 million bps Yeast 15 million bps E coli 5 million bps ~ 400 genomes sequenced Mona Singh How are genomes sequenced? • Can only sequence a few hundred base pairs at a time • Make many copies of the DNA and cut into smaller (overlapping) pieces • Assemble pieces: certain substrings occur in multiple fragments 5 Mona Singh Genomes to Life ATGCCTTAC GTACCCTGC GGCAGCACT Genome 6 Mona Singh ? • Portions of DNA code for genes, which carry the information for making proteins • Proteins play key roles in most biological processes (e.g., signaling, catalysis, immune response, etc.) 7 Mona Singh Gene Finding gucgcuaccauuaccaguuggucuggugucaaaaauaauaau aaccgggcaggccaugucugcccguauuucgcguaaggaaau ccauuauguacuauuuaaaaaacacaaacuuuuggauguucg guuuauucuuuuucuuuuacuuuuuuaucaugggagccuacu ucccguuuuucccgauuuggcuacaugacaucaaccauauca gcaaaagugauacggguauuauuuuugccgcuauuucucugu ucucgcuauuauuccaaccgcuguuuggucugcuuucugaca aacucgggcugcgcaaauaccugcuguggauuauuaccggca uguuagugauguuugcgccguucuuuauuuuuaucuucgggc cacuguuacaauacaacauuuuaguaggaucgauuguuggug guauuuaucuaggcuuuuguuuuaacgccggugcgccagcag uagaggcauuuauugagaaagucagccgucgcaguaauuucg aauuuggucgcgcgcggauguuuggcuguguuggcugggcgc ugugugccucgauugucggcaucauguucaccaucaauaauc aguuuguuuucuggcugggcucuggcugugcacucauccucg ccguuuuacucuuuuucgccaaaacggaugcgcccucuucug ccacgguugccaaugcgguaggugccaaccauucggcauuua gccuuaagcuggcacuggaacuguucagacagccaaaacugu gguuuuugucacuguauguuauuggcguuuccugcaccuacg auGuuuuugaccaacaguuugcuaauuucuuuacuucguucu gucaggugaa...gcaaucaaugucggaugcggcgcgacgcu 8 Mona Singh Gene Finding gucgcuaccauuaccaguuggucuggugucaaaaauaauaau aaccgggcaggccaugucugcccguauuucgcguaaggaaau ccauuauguacuauuuaaaaaacacaaacuuuuggauguucg guuuauucuuuuucuuuuacuuuuuuaucaugggagccuacu ucccguuuuucccgauuuggcuacaugacaucaaccauauca gcaaaagugauacggguauuauuuuugccgcuauuucucugu ucucgcuauuauuccaaccgcuguuuggucugcuuucugaca aacucgggcugcgcaaauaccugcuguggauuauuaccggca uguuagugauguuugcgccguucuuuauuuuuaucuucgggc cacuguuacaauacaacauuuuaguaggaucgauuguuggug guauuuaucuaggcuuuuguuuuaacgccggugcgccagcag uagaggcauuuauugagaaagucagccgucgcaguaauuucg aauuuggucgcgcgcggauguuuggcuguguuggcugggcgc ugugugccucgauugucggcaucauguucaccaucaauaauc aguuuguuuucuggcugggcucuggcugugcacucauccucg ccguuuuacucuuuuucgccaaaacggaugcgcccucuucug ccacgguugccaaugcgguaggugccaaccauucggcauuua gccuuaagcuggcacuggaacuguucagacagccaaaacugu gguuuuugucacuguauguuauuggcguuuccugcaccuacg auGuuuuugaccaacaguuugcuaauuucuuuacuucguucu gucaggugaa...gcaaucaaugucggaugcggcgcgacgcu 9 Mona Singh MYYLKNTNFWMFGLFFFFYFFIMGAY FPFFPIWLHDINHISKSDTGIIFAAI SLFSLLFQPLFGLLSDKLGLRKYLLW IITGMLVMFAPFFIFIFGPLLQYNIL VGSIVGGIYLGFCFNAGAPAVEAFIE KVSRRSNFEFGRARMFGCVGWALCAS IVGIMFTINNQFVFWLGSGCALILAV LLFFAKTDAPSSATVANAVGANHSAF SLKLALELFRQPKLWFLSLYVIGVSC TYDVFDQQFANFFTSFFATGEQGTRV FGYVTTMGELLNASIMFFAPLIINRI GGKNALLLAGTIMSVRIIGSSFATSA LEVVILKTLHMFEVPFLLVGCFKYIT The Genetic Code AUG = methionine/start UUA = Leucine UUG = Leucine UAA = Stop UAG = Stop UGA = Stop . . . 10 Mona Singh Stryer, Biochemistry Gene Finding gucgcuaccauuaccaguuggucuggugucaaaaauaauaauaaccgg gcaggccaugucugcccguauuucgcguaaggaaauccauuauguacu auuuaaaaaacacaaacuuuuggauguucgguuuauucuuuuucuuuu acuuuuuuaucaugggagccuacuucccguuuuucccgauuuggcuac augacaucaaccauaucagcaaaagugauacggguauuauuuuugccg cuauuucucuguucucgcuauuauuccaaccgcuguuuggucugcuuu cugacaaacucgggcugcgcaaauaccugcuguggauuauuaccggca uguuagugauguuugcgccguucuuuauuuuuaucuucgggccacugu uacaauacaacauuuuaguaggaucgauuguuggugguauuuaucuag gcuuuuguuuuaacgccggugcgccagcaguagaggcauuuauugaga aagucagccgucgcaguaauuucgaauuuggucgcgcgcggauguuug gcuguguuggcugggcgcugugugccucgauugucggcaucauguuca ccaucaauaaucaguuuguuuucuggcugggcucuggcugugcacuca uccucgccguuuuacucuuuuucgccaaaacggaugcgcccucuucug ccacgguugccaaugcgguaggugccaaccauucggcauuuagccuua agcuggcacuggaacuguucagacagccaaaacugugguuuuugucac uguauguuauuggcguuuccugcaccuacgauguuuuugaccaacagu uugcuaauuucuuuacuucguucugucaggugaa...gcaaucaaugu cggaugcggcgcgacgcu 11 Mona Singh Gene Finding Reading off from 1st start triplet aug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ... Translating Met Ser Ala Arg Ile Ser Arg Lys Glu (3 letter amino Ile His Tyr Val Leu Phe Lys ... acid code) (1 letter code) M S A R I S R K E I H Y V L F K ... 12 Mona Singh Gene Finding Reading off from 1st start triplet aug ucu gcc cgu auu ucg cgu aag gaa auc cau uau gua cua uuu aaa ... Translating Met Ser Ala Arg Ile Ser Arg Lys Glu (3 letter amino Ile His Tyr Val Leu Phe Lys ... acid code) (1 letter code) M S A R I S R K E I H Y V L F K ... Actual protein sequence 13 Mona Singh M Y Y L K N T N F W M F G L F F ... Computational Gene Finding Methods • Statistical bias: protein coding regions “look different” - compare coding vs. non-coding regions (Hidden Markov Models, Neural Nets) • Sequence similarity - similar to known protein? 14 Mona Singh Gene finding is hard • In some genomes, only a small portion of genome codes for protein (needle in haystack) • Some genes contain introns and exons – exons are the part that actually encode the protein part – and exons can be short • Have to get the precise boundaries to get correct protein 15 Mona Singh Number of genes 16 Human ~30,000 Mouse ~30,000 Fruit fly ~13,500 Nematode worm ~19,000 Yeast ~6,000 E coli ~4,000 Mona Singh Predicting Protein Function MYYLKNTNFWMFGLFFFFYFFIMGAY FPFFPIWLHDINHISKSDTGIIFAAI SLFSLLFQPLFGLLSDKLGLRKYLLW IITGMLVMFAPFFIFIFGPLLQYNIL VGSIVGGIYLGFCFNAGAPAVEAFIE KVSRRSNFEFGRARMFGCVGWALCAS IVGIMFTINNQFVFWLGSGCALILAV LLFFAKTDAPSSATVANAVGANHSAF SLKLALELFRQPKLWFLSLYVIGVSC TYDVFDQQFANFFTSFFATGEQGTRV FGYVTTMGELLNASIMFFAPLIINRI GGKNALLLAGTIMSVRIIGSSFATSA LEVVILKTLHMFEVPFLLVGCFKYIT 17 Mona Singh DNA binding protein Functions of Human Proteins Science, 2001 18 Mona Singh Sequence similarity Ex: cystic fibrosis gene and bacterial nickel transport gene CF: EGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLL----NT: QAAQPLVHGVSLTLQRGRVLALVGGSGSGKSLTCAATLGILPAGVR CF: NTEGEIQIDGVSWDSITL---------QQWRKAFGVIPQKVFIFSG NT: QTAGEILADGKPVSPCALRGIKIATIMQNPRSAFNPL--------CF: TFRKNLDPYEQWSDQEIWKVADEVGLRSVIEQFP-GKLDFVLVDGG NT: ---HTMHTHARETCLALGKPADDATLTAAIEAVGLENAARVLKLYP CF: CVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDPV NT: FEMSGGMLQRMMIAMAVLCESPFIIADEPTTDLDVV 19 Mona Singh Database Searches http://www.ncbi.nlm.nih.gov 20 Mona Singh Database Searches Sequences producing significant alignments: gi|5523990|gb|AAD44047.1|AF108138_1 (AF108138) DNA helicase gi|7511524|pir||T37310 PIF1 protein - Caenorhabditis elegans helicase gi|7493349|pir||T40739 rrm3-pif1 helicase homolog - fission... gi|11282390|pir||T47241 RRM3/PIF1 helicase homolog - fission yeast gi|6321820|ref|NP_011896.1| DNA helicase; Rrm3p [Saccharomyces gi|6323579|ref|NP_013650.1| 5' to 3' DNA helicase; Pif1p [Saccharo gi|558414|emb|CAA86260.1| (Z38114) len: 750, CAI: 0.14, inc... gi|7687929|emb|CAB89609.1| (AL354532) possible DNA helicase... 21 Mona Singh E-Value 4e-84 1e-77 3e-59 3e-59 4e-43 1e-41 1e-41 4e-41 Protein Structure Sequence: KETAAAKFERQHMDSSTSAASSSN… Structure: 22 Mona Singh Proteins Primary Amino acids 23 Secondary a-helix Tertiary Quaternary Polypeptide chain Assembled subunits Mona Singh Lehninger, Principles of Biochemistry Protein Structure Prediction • Physics-based methods • Statistics-based method 24 Mona Singh Statistics & Protein Structure Prediction Given a new sequence and a library of folds, figure out which (if any) is a good fit to the sequence. 25 Mona Singh Secondary structure prediction • Given a protein sequence, can you tell its secondary structure – E.g., LKVVAKRELVQNNQ aaaa bbbb aaaaaaa a=alpha, b=beta : ~70% accuracy (neural nets or other learning techniques) 26 Mona Singh Genome annotation • Many other important features of DNA – E.g., proteins bind DNA regulatory elements: determines which genes are “on” when • Statistical & comparative approaches for finding them – Motif finding 27 Mona Singh Universal phylogenetic tree Prokaryotes 28 Eukaryotes Mona Singh Woese et al. Building phylogenetic trees Use DNA (or protein) sequences from various organisms e.g., human ATCGAGGC mouse ATCCAGCC yeast ATTAAGTA 29 Mona Singh Building phylogenetic trees Human E.g., Distance Matrix: 1 Human 30 Mona Singh Human 0 2 4 Mouse Yeast 0 4 4 0 1 Tree: Mouse Yeast 2 4 2 1 Mouse Yeast Intracellular networks Stimulus Stimulus Protein DNA 31 Mona Singh RNA Network of cells fn fn Protein DNA RNA Protein DNA RNA fn fn Protein DNA RNA Protein DNA RNA fn fn Protein DNA 32 Mona Singh RNA Protein DNA RNA fn fn Protein DNA 33 Mona Singh RNA Lecture Notes • www.cs.princeton.edu/~mona/computational_biology_ notes.html 34 Mona Singh