Download Gill: Genome Content - Protein Coding Genes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CS273A
Lecture 3: Protein coding genes
MW 12:50-2:05pm in Beckman B302
Profs: Serafim Batzoglou & Gill Bejerano
TAs: Harendra Guturu & Panos Achlioptas
http://cs273a.stanford.edu [BejeranoFall13/14]
1
Announcements
• http://cs273a.stanford.edu/ is up
– Course guidelines, lecture slides, etc.
• Communications via Piazza
–Auditors please sign up too
– TA Office hours TBA before HW1
• Project groups: TBD after “shopping season”
• Tutorials: First three Fridays
– Recommended to bring your laptop to UCSC tutorial 10/4
• Lots of genomics research happening on campus
– If you enjoy this class many labs would love to have you!
http://cs273a.stanford.edu [BejeranoFall13/14]
2
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA
TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA
ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT
ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAG...TTGCGAA
TCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAA
TTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGA
CCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT
TTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACAT
AAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAA
AGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAAT
AGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTAC
CCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATAT
ACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCG
GGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTC
CTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATT
TGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGT
TTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG
TTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATA
TATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATG
TTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTA
AGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGA
ATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATA
TCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATG
TCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACT
3
ATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGAT
The Biggest Challenge in Genomics…
… is computational:
How does this
Program
encode this
Output
This “coding” question has profound implications for our lives
http://cs273a.stanford.edu [BejeranoFall13/14]
4
The Biggest Challenge in Genomics…
… is computational:
How does this
Program
encode this
Bugs
Output
What genomic mutations predispose us to disease?
http://cs273a.stanford.edu [BejeranoFall13/14]
5
The Biggest Challenge in Genomics…
… is computational:
How does this
Program
encode this
Bugs
Debugging
What genomic mutations determine our drug response?
http://cs273a.stanford.edu [BejeranoFall13/14]
6
The Biggest Challenge in Genomics…
… is computational:
How does this
Program
encode this
Output
What in our genomes make us different from each other?
http://cs273a.stanford.edu [BejeranoFall13/14]
7
The Biggest Challenge in Genomics…
… is computational:
How does this
Program
encode this
Output
What in our genomes make us different from related species?
http://cs273a.stanford.edu [BejeranoFall13/14]
8
The Biggest Challenge in Genomics…
… is computational:
How does this
Program
encode this
Output
Why is our genome full of “memory leaks”?
http://cs273a.stanford.edu [BejeranoFall13/14]
9
Genomics will affect multiple fields of CS
Storage
Compression
Architecture
Databases
HCI
etc. etc.
http://cs273a.stanford.edu [BejeranoFall13/14]
10
We need to understand the genome
http://cs273a.stanford.edu [BejeranoFall13/14]
11
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA
TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA
ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT
ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTT
TGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGT
TCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATAC
ATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT
GCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA
CGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGA
ATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACA
TCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAAC
GGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAA
CTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTG
GCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTC
TTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAAT
TGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT
GCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT
AATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCT
TCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT
AATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGA
TTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTA
CTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTT
TACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTT
ACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAA
12
AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGT
Central Dogma of Biology
Genomes, Genes & Proteins
The most visible instructions in our genome are Genes.
Genes explain exactly HOW to synthesize any protein.
Proteins are the work horses of every living cell.
gene
Genome:
...ACGTACGACTGACTAGCATCGACTACGACTAGCAC...
protein
http://cs273a.stanford.edu [BejeranoFall13/14]
cell
14
Gene Structure
http://cs273a.stanford.edu [BejeranoFall13/14]
15
Gene Processing
http://cs273a.stanford.edu [BejeranoFall13/14]
16
Translation: The Genetic Code
http://cs273a.stanford.edu [BejeranoFall13/14]
17
The gene centric genome
“The Genetic code”
A gene centric term.
For a gene centric world.
There are in fact a number
of additional genetic codes
encoded in our genome..
http://cs273a.stanford.edu [BejeranoFall13/14]
18
Visualizing Gene Structure
http://cs273a.stanford.edu [BejeranoFall13/14]
19
Genes in the Human Genome
UCSC primer
There are ~25,000 protein coding genes in the human genome.
(Even half way through sequencing the human genome,
Researchers thought there will be well over 100,000 genes).
http://cs273a.stanford.edu [BejeranoFall13/14]
20
Gene Finding I: ab initio
Computational Challenge:
“Find the genes, the whole genes, and nothing but the genes”
CS262
Winter
Understand Biology

Write discovery tools
(Our) answer depends on our understanding, data & tools
http://cs273a.stanford.edu [BejeranoFall13/14]
21
Everything in Genomics is a Moving Target




The genomes (ie, assemblies)
Their annotations
Our understanding of Biology
The portals
Conclusion:
write code
that can be
run...
and rerun
and rerun
and rerun
and rerun
22
Gene (Protein really) Functions
The most visible instructions in our genome are Genes.
Genes explain exactly HOW to synthesize any protein.
Proteins are the work horses of every living cell.
gene
Genome:
...ACGTACGACTGACTAGCATCGACTACGACTAGCAC...
Just look at the cell.
Lots and lots of different
functions to perform.
(“Only 20,000 genes”..)
protein
http://cs273a.stanford.edu [BejeranoFall13/14]
cell
23
First full draft of the Human Genome
Human Genome Consortium
(HGC)
Celera
2001
Serafim discussed the current state of sequencing
http://cs273a.stanford.edu [BejeranoFall13/14]
24
Biological Functions of the Human Gene Set
Focus on
the X axis:
[HGC, 2001]
http://cs273a.stanford.edu [BejeranoFall13/14]
25
Molecular Functions of the Human Gene Set
[Celera, 2001]
http://cs273a.stanford.edu [BejeranoFall13/14]
26
Gene Ontologies
1. Make a controlled vocabulary of gene functions.
2. Annotate all genes using this vocabulary.
Map: genes  papers  biological functions.
(plenty room for Natural Language Processing)
Used to catalog human gene functions, and also
which genes are expressed where,
what defects have been found when
certain genes are mutated, etc.
http://cs273a.stanford.edu [BejeranoFall13/14]
27
Genes & Their Functions
Gene (DNA) sequence determines protein (AA) sequence,
which determines protein (3D) structure,
which determines protein’s function.
http://cs273a.stanford.edu [BejeranoFall13/14]
28
Protein Folding
Protein folding is the challenge of deducing protein structure
from protein sequence.
New CS faculty joining in February ’14: Ron Dror
http://cs273a.stanford.edu [BejeranoFall13/14]
29
Gene Families, Gene Names
Genes (proteins) come in families.
Genes of the same family have
similar sequences.
Which is why the fold into similar
structure and perform similar
functions.
Genes of the same family will
typically have a “family name”
followed by a (sequential) number
or “first name”.
http://cs273a.stanford.edu [BejeranoFall13/14]
30
Biological vs. Molecular Function: Pathways
Proteins with very different molecular functions participate to
manifest a single biological function, for example: a pathway.
http://cs273a.stanford.edu [BejeranoFall13/14]
31
Some “Special” Functions: Gene Regulation
2,000 different proteins can bind specific DNA sequences.
Proteins
DNA
Protein binding site
Gene
DNA
Proteins that regulate the transcription of other proteins
are called transcription factors.
http://cs273a.stanford.edu [BejeranoFall13/14]
32
The Importance of Gene Regulation
The looks & capabilities of different cells are
determined by the subset of genes they express.
Different cell types express very different gene
repertoires (from the same genome).
To change its behavior a cell can change its
transcriptional program.
Think of it as a giant state machine…
http://cs273a.stanford.edu [BejeranoFall13/14]
33
“Special” Function: Cell Signaling
Cells also talk with each other. They send and receive messages,
and change their behavior according to messages they receive.
http://cs273a.stanford.edu [BejeranoFall13/14]
34
Signal Transduction
Now its an even bigger
state machine of
individual state machines
(=cells) talking with each
other, orchestrating their
individual activities.
http://cs273a.stanford.edu [BejeranoFall13/14]
35
Alternative Splicing
http://cs273a.stanford.edu [BejeranoFall13/14]
36
Genes in the Human Genome
When you only show one transcript per gene locus:
If you ask the GUI to show you all well established gene variants:
http://cs273a.stanford.edu [BejeranoFall13/14]
37
Protein Domains
SKSHSEAGSAFIQTQQLHAAMADTFLEHMCRLDIDSAPITARNTG
IICTIGPASRSVETLKEMIKSGMNVARMNFSHGTHEYHAETIKNV
RTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKK
GATLKITLDNAYMAACDENILWLDYKNICKVVEVGSKVYVDDGLI
SLQVKQKGPDFLVTEVENGGFLGSKKGVNLPGAAVDLPAVSEKDI
QDLKFGVDEDVDMVFASFIRKAADVHEVRKILGEKGKNIKIISKI
ENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMIIGR
CNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIML
SGETAKGDYPLEAVRMQHLIAREAEAAMFHRKLFEELARSSSHST
DLMEAMAMGSVEASYKCLAAALIVLTESGRSAHQVARYRPRAPII
AVTRNHQTARQAHLYRGIFPVVCKDPVQEAWAEDVDLRVNLAMNV
GKAAGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP
A protein domain is a subsequence of the protein that folds
independently of the other portions of the sequence, and often
confers to the protein one or more specific functions.
http://cs273a.stanford.edu [BejeranoFall13/14]
38
Alt. Splicing and Protein Repertoire
Alternative splicing often produces protein variants that have a
different domain composition, and thus perform different functions.
http://cs273a.stanford.edu [BejeranoFall13/14]
39
Retroposed Genes and Pseudogenes
Pseudogenes (“dead genes”):
Genomic sequences that
resemble (originated from) genes
that no longer make proteins.
Retrogenes (“retrotranscribed”):
Protein coding RNA that was
reverse transcribed and inserted
back into the genome.
The RNA can be grabbed at any
stage (partial/full transcript,
before/during/after all introns are
spliced).
http://cs273a.stanford.edu [BejeranoFall13/14]
40
Review
Lecture 3
• Central dogma recap
– Focus on protein coding genes
• Gene structure
– exon, intron, 3’/5’ utr, CDS recap
– The genetic code
– UCSC genome browser sneak peak
– human genome stats
– Gene finding I: ab initio
• Gene (protein) function
– Cell structure, chemical reactions etc
– Pathways (vs. function)
– information processing roles
•
•
TFs
signaling: ligands, receptors, kinases
• Gene families
– similar sequence -> structure -> function
– protein domains
– splice variants, alt promoters
• Special cases
– Pseudogenes
– Retroposed genes (and the distinction between the two)
• Gene ontologies
http://cs273a.stanford.edu [BejeranoFall13/14]
41
Related documents