Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
CS273A Lecture 3: Protein coding genes MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos Achlioptas http://cs273a.stanford.edu [BejeranoFall13/14] 1 Announcements • http://cs273a.stanford.edu/ is up – Course guidelines, lecture slides, etc. • Communications via Piazza –Auditors please sign up too – TA Office hours TBA before HW1 • Project groups: TBD after “shopping season” • Tutorials: First three Fridays – Recommended to bring your laptop to UCSC tutorial 10/4 • Lots of genomics research happening on campus – If you enjoy this class many labs would love to have you! http://cs273a.stanford.edu [BejeranoFall13/14] 2 ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT AAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA AGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT AGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC CCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT ACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG GGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT TGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT TTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA TATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG TTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA AGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA ATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA TCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT 3 ATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT The Biggest Challenge in Genomics… … is computational: How does this Program encode this Output This “coding” question has profound implications for our lives http://cs273a.stanford.edu [BejeranoFall13/14] 4 The Biggest Challenge in Genomics… … is computational: How does this Program encode this Bugs Output What genomic mutations predispose us to disease? http://cs273a.stanford.edu [BejeranoFall13/14] 5 The Biggest Challenge in Genomics… … is computational: How does this Program encode this Bugs Debugging What genomic mutations determine our drug response? http://cs273a.stanford.edu [BejeranoFall13/14] 6 The Biggest Challenge in Genomics… … is computational: How does this Program encode this Output What in our genomes make us different from each other? http://cs273a.stanford.edu [BejeranoFall13/14] 7 The Biggest Challenge in Genomics… … is computational: How does this Program encode this Output What in our genomes make us different from related species? http://cs273a.stanford.edu [BejeranoFall13/14] 8 The Biggest Challenge in Genomics… … is computational: How does this Program encode this Output Why is our genome full of “memory leaks”? http://cs273a.stanford.edu [BejeranoFall13/14] 9 Genomics will affect multiple fields of CS Storage Compression Architecture Databases HCI etc. etc. http://cs273a.stanford.edu [BejeranoFall13/14] 10 We need to understand the genome http://cs273a.stanford.edu [BejeranoFall13/14] 11 ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTT TGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGT TCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATAC ATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT GCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA CGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGA ATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACA TCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAAC GGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAA CTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTG GCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTC TTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAAT TGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT GCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT AATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCT TCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT AATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGA TTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTA CTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTT TACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTT ACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAA 12 AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGT Central Dogma of Biology Genomes, Genes & Proteins The most visible instructions in our genome are Genes. Genes explain exactly HOW to synthesize any protein. Proteins are the work horses of every living cell. gene Genome: ...ACGTACGACTGACTAGCATCGACTACGACTAGCAC... protein http://cs273a.stanford.edu [BejeranoFall13/14] cell 14 Gene Structure http://cs273a.stanford.edu [BejeranoFall13/14] 15 Gene Processing http://cs273a.stanford.edu [BejeranoFall13/14] 16 Translation: The Genetic Code http://cs273a.stanford.edu [BejeranoFall13/14] 17 The gene centric genome “The Genetic code” A gene centric term. For a gene centric world. There are in fact a number of additional genetic codes encoded in our genome.. http://cs273a.stanford.edu [BejeranoFall13/14] 18 Visualizing Gene Structure http://cs273a.stanford.edu [BejeranoFall13/14] 19 Genes in the Human Genome UCSC primer There are ~25,000 protein coding genes in the human genome. (Even half way through sequencing the human genome, Researchers thought there will be well over 100,000 genes). http://cs273a.stanford.edu [BejeranoFall13/14] 20 Gene Finding I: ab initio Computational Challenge: “Find the genes, the whole genes, and nothing but the genes” CS262 Winter Understand Biology Write discovery tools (Our) answer depends on our understanding, data & tools http://cs273a.stanford.edu [BejeranoFall13/14] 21 Everything in Genomics is a Moving Target The genomes (ie, assemblies) Their annotations Our understanding of Biology The portals Conclusion: write code that can be run... and rerun and rerun and rerun and rerun 22 Gene (Protein really) Functions The most visible instructions in our genome are Genes. Genes explain exactly HOW to synthesize any protein. Proteins are the work horses of every living cell. gene Genome: ...ACGTACGACTGACTAGCATCGACTACGACTAGCAC... Just look at the cell. Lots and lots of different functions to perform. (“Only 20,000 genes”..) protein http://cs273a.stanford.edu [BejeranoFall13/14] cell 23 First full draft of the Human Genome Human Genome Consortium (HGC) Celera 2001 Serafim discussed the current state of sequencing http://cs273a.stanford.edu [BejeranoFall13/14] 24 Biological Functions of the Human Gene Set Focus on the X axis: [HGC, 2001] http://cs273a.stanford.edu [BejeranoFall13/14] 25 Molecular Functions of the Human Gene Set [Celera, 2001] http://cs273a.stanford.edu [BejeranoFall13/14] 26 Gene Ontologies 1. Make a controlled vocabulary of gene functions. 2. Annotate all genes using this vocabulary. Map: genes papers biological functions. (plenty room for Natural Language Processing) Used to catalog human gene functions, and also which genes are expressed where, what defects have been found when certain genes are mutated, etc. http://cs273a.stanford.edu [BejeranoFall13/14] 27 Genes & Their Functions Gene (DNA) sequence determines protein (AA) sequence, which determines protein (3D) structure, which determines protein’s function. http://cs273a.stanford.edu [BejeranoFall13/14] 28 Protein Folding Protein folding is the challenge of deducing protein structure from protein sequence. New CS faculty joining in February ’14: Ron Dror http://cs273a.stanford.edu [BejeranoFall13/14] 29 Gene Families, Gene Names Genes (proteins) come in families. Genes of the same family have similar sequences. Which is why the fold into similar structure and perform similar functions. Genes of the same family will typically have a “family name” followed by a (sequential) number or “first name”. http://cs273a.stanford.edu [BejeranoFall13/14] 30 Biological vs. Molecular Function: Pathways Proteins with very different molecular functions participate to manifest a single biological function, for example: a pathway. http://cs273a.stanford.edu [BejeranoFall13/14] 31 Some “Special” Functions: Gene Regulation 2,000 different proteins can bind specific DNA sequences. Proteins DNA Protein binding site Gene DNA Proteins that regulate the transcription of other proteins are called transcription factors. http://cs273a.stanford.edu [BejeranoFall13/14] 32 The Importance of Gene Regulation The looks & capabilities of different cells are determined by the subset of genes they express. Different cell types express very different gene repertoires (from the same genome). To change its behavior a cell can change its transcriptional program. Think of it as a giant state machine… http://cs273a.stanford.edu [BejeranoFall13/14] 33 “Special” Function: Cell Signaling Cells also talk with each other. They send and receive messages, and change their behavior according to messages they receive. http://cs273a.stanford.edu [BejeranoFall13/14] 34 Signal Transduction Now its an even bigger state machine of individual state machines (=cells) talking with each other, orchestrating their individual activities. http://cs273a.stanford.edu [BejeranoFall13/14] 35 Alternative Splicing http://cs273a.stanford.edu [BejeranoFall13/14] 36 Genes in the Human Genome When you only show one transcript per gene locus: If you ask the GUI to show you all well established gene variants: http://cs273a.stanford.edu [BejeranoFall13/14] 37 Protein Domains SKSHSEAGSAFIQTQQLHAAMADTFLEHMCRLDIDSAPITARNTG IICTIGPASRSVETLKEMIKSGMNVARMNFSHGTHEYHAETIKNV RTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKK GATLKITLDNAYMAACDENILWLDYKNICKVVEVGSKVYVDDGLI SLQVKQKGPDFLVTEVENGGFLGSKKGVNLPGAAVDLPAVSEKDI QDLKFGVDEDVDMVFASFIRKAADVHEVRKILGEKGKNIKIISKI ENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMIIGR CNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIML SGETAKGDYPLEAVRMQHLIAREAEAAMFHRKLFEELARSSSHST DLMEAMAMGSVEASYKCLAAALIVLTESGRSAHQVARYRPRAPII AVTRNHQTARQAHLYRGIFPVVCKDPVQEAWAEDVDLRVNLAMNV GKAAGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP A protein domain is a subsequence of the protein that folds independently of the other portions of the sequence, and often confers to the protein one or more specific functions. http://cs273a.stanford.edu [BejeranoFall13/14] 38 Alt. Splicing and Protein Repertoire Alternative splicing often produces protein variants that have a different domain composition, and thus perform different functions. http://cs273a.stanford.edu [BejeranoFall13/14] 39 Retroposed Genes and Pseudogenes Pseudogenes (“dead genes”): Genomic sequences that resemble (originated from) genes that no longer make proteins. Retrogenes (“retrotranscribed”): Protein coding RNA that was reverse transcribed and inserted back into the genome. The RNA can be grabbed at any stage (partial/full transcript, before/during/after all introns are spliced). http://cs273a.stanford.edu [BejeranoFall13/14] 40 Review Lecture 3 • Central dogma recap – Focus on protein coding genes • Gene structure – exon, intron, 3’/5’ utr, CDS recap – The genetic code – UCSC genome browser sneak peak – human genome stats – Gene finding I: ab initio • Gene (protein) function – Cell structure, chemical reactions etc – Pathways (vs. function) – information processing roles • • TFs signaling: ligands, receptors, kinases • Gene families – similar sequence -> structure -> function – protein domains – splice variants, alt promoters • Special cases – Pseudogenes – Retroposed genes (and the distinction between the two) • Gene ontologies http://cs273a.stanford.edu [BejeranoFall13/14] 41