Download Protein Coding Genes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CS173
Lecture 3: Protein coding genes
MW 11:00-12:15 in Beckman B302
Prof: Gill Bejerano
TAs: Jim Notwell & Harendra Guturu
http://cs173.stanford.edu [BejeranoWinter12/13]
1
Annonuncements
• http://cs173.stanford.edu/ is up
– Course guidelines, lecture slides, etc.
• Communications via Pizza
– Private Q: post to “instructors” not “class”
– Auditors sign up too
– Office hours TBA before HW1
• Project groups: TBD after “shopping season”
• Tutorials: first three Wednesdays
– Recommended to bring your laptop to UCSC tutorial 1/16
• We will be recruiting for our lab from class
– Many other labs on campus would love to have you too!
http://cs173.stanford.edu [BejeranoWinter12/13]
2
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA
TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA
ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT
ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTT
TGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGT
TCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATAC
ATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT
GCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA
CGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGA
ATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACA
TCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAAC
GGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAA
CTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTG
GCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTC
TTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAAT
TGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT
GCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT
AATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCT
TCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT
AATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGA
TTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTA
CTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTT
TACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTT
ACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAA
http://cs173.stanford.edu [BejeranoWinter12/13]
3
AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGT
Central Dogma of Biology
Genomes, Genes & Proteins
The most visible instructions in our genome are Genes.
Genes explain exactly HOW to synthesize any protein.
Proteins are the work horses of every living cell.
gene
Genome:
...ACGTACGACTGACTAGCATCGACTACGACTAGCAC...
protein
http://cs173.stanford.edu [BejeranoWinter12/13]
cell
5
Gene Structure
http://cs173.stanford.edu [BejeranoWinter12/13]
6
Gene Processing
http://cs173.stanford.edu [BejeranoWinter12/13]
7
Translation: The Genetic Code
http://cs173.stanford.edu [BejeranoWinter12/13]
8
The gene centric genome
“The Genetic code”
A gene centric term.
For a gene centric world.
But fashions change.
Controlled by mass media,
technology, money, and a
bit of scientific truth.
http://cs173.stanford.edu [BejeranoWinter12/13]
9
Visualizing Gene Structure
http://cs173.stanford.edu [BejeranoWinter12/13]
10
Genes in the Human Genome
There are ~25,000 protein coding genes in the human genome.
(Even half way through sequencing the human genome,
Researchers thought there will be well over 100,000 genes).
http://cs173.stanford.edu [BejeranoWinter12/13]
11
Everything in Genomics is a Moving Target




The genomes (ie, assemblies)
Their annotations
Our understanding of Biology
The portals
Conclusion:
write code
that can be
run...
Why ~25,000?
and rerun
and rerun
and rerun
and rerun
http://cs173.stanford.edu [BejeranoWinter12/13]
12
Gene Finding I: ab initio
Challenge:
“Find the genes, the whole genes, and nothing but the genes”
Understand Biology

Write discovery tools
(Our) answer depends on our understanding, data & tools
http://cs173.stanford.edu [BejeranoWinter12/13]
13
Gene (Protein really) Functions
The most visible instructions in our genome are Genes.
Genes explain exactly HOW to synthesize any protein.
Proteins are the work horses of every living cell.
gene
Genome:
...ACGTACGACTGACTAGCATCGACTACGACTAGCAC...
Just look at the cell.
Lots and lots of different
functions to perform.
(“Only 20,000 genes”..)
protein
http://cs173.stanford.edu [BejeranoWinter12/13]
cell
14
First full draft of the Human Genome
Human Genome Consortium
(HGC)
http://cs173.stanford.edu [BejeranoWinter12/13]
Celera
2001
15
Biological Functions of the Human Gene Set
Focus on
the X axis:
[HGC, 2001]
http://cs173.stanford.edu [BejeranoWinter12/13]
16
Molecular Functions of the Human Gene Set
[Celera, 2001]
http://cs173.stanford.edu [BejeranoWinter12/13]
17
Biological vs. Molecular Function: Pathways
Proteins with very different molecular functions participate to
manifest a single biological function, for example: a pathway.
http://cs173.stanford.edu [BejeranoWinter12/13]
18
“Special” Function: Gene Regulation
2,000 different proteins can bind specific DNA sequences.
Proteins
DNA
Protein binding site
Gene
DNA
Proteins that regulate the transcription of other proteins
are called transcription factors.
http://cs173.stanford.edu [BejeranoWinter12/13]
19
The Importance of Gene Regulation
The looks & capabilities of different cells are
determined by the subset of genes they express.
Different cell types express very different gene
repertoires (from the same genome).
To change its behavior a cell can change its
transcriptional program.
Think of it as a giant state machine…
http://cs173.stanford.edu [BejeranoWinter12/13]
20
“Special” Function: Cell Signaling
Cells also talk with each other. They send and receive messages,
and change their behavior according to messages they receive.
http://cs173.stanford.edu [BejeranoWinter12/13]
21
Signal Transduction
Now its an even bigger
state machine of
individual state machines
(=cells) talking with each
other, orchestrating their
individual activities.
http://cs173.stanford.edu [BejeranoWinter12/13]
22
Back to Genes & Their Functions
Gene (DNA) sequence determines protein (AA) sequence,
which determines protein (3D) structure,
which determines protein’s function.
http://cs173.stanford.edu [BejeranoWinter12/13]
23
Protein Folding
Protein folding is the challenge of deducing protein structure
from protein sequence. It’s a tough one…
http://cs173.stanford.edu [BejeranoWinter12/13]
24
Gene Families, Gene Names
Genes (proteins) come in families.
Genes of the same family have
similar sequences.
Which is why the fold into similar
structure and perform similar
functions.
Genes of the same family will
typically have a “family name”
followed by a (sequential) number
or “first name”.
http://cs173.stanford.edu [BejeranoWinter12/13]
25
Alternative Splicing
http://cs173.stanford.edu [BejeranoWinter12/13]
26
Genes in the Human Genome
When you only show one transcript per gene locus:
If you ask the GUI to show you all well established gene variants:
http://cs173.stanford.edu [BejeranoWinter12/13]
27
Protein Domains
SKSHSEAGSAFIQTQQLHAAMADTFLEHMCRLDIDSAPITARNTG
IICTIGPASRSVETLKEMIKSGMNVARMNFSHGTHEYHAETIKNV
RTATESFASDPILYRPVAVALDTKGPEIRTGLIKGSGTAEVELKK
GATLKITLDNAYMAACDENILWLDYKNICKVVEVGSKVYVDDGLI
SLQVKQKGPDFLVTEVENGGFLGSKKGVNLPGAAVDLPAVSEKDI
QDLKFGVDEDVDMVFASFIRKAADVHEVRKILGEKGKNIKIISKI
ENHEGVRRFDEILEASDGIMVARGDLGIEIPAEKVFLAQKMIIGR
CNRAGKPVICATQMLESMIKKPRPTRAEGSDVANAVLDGADCIML
SGETAKGDYPLEAVRMQHLIAREAEAAMFHRKLFEELARSSSHST
DLMEAMAMGSVEASYKCLAAALIVLTESGRSAHQVARYRPRAPII
AVTRNHQTARQAHLYRGIFPVVCKDPVQEAWAEDVDLRVNLAMNV
GKAAGFFKKGDVVIVLTGWRPGSGFTNTMRVVPVP
A protein domain is a subsequence of the protein that folds
independently of the other portions of the sequence, and often
confers to the protein one or more specific functions.
http://cs173.stanford.edu [BejeranoWinter12/13]
28
Alt. Splicing and Protein Repertoire
Alternative splicing often produces protein variants that have a
different domain composition, and thus perform different functions.
http://cs173.stanford.edu [BejeranoWinter12/13]
29
Retroposed Genes and Pseudogenes
Pseudogenes (“dead genes”):
Genomic sequences that
resemble (originated from) genes
that no longer make proteins.
Retrogenes (“retrotranscribed”):
Protein coding RNA that was
reverse transcribed and inserted
back into the genome.
The RNA can be grabbed at any
stage (partial/full transcript,
before/during/after all introns are
spliced).
http://cs173.stanford.edu [BejeranoWinter12/13]
30
Gene Ontologies
1. Make a controlled vocabulary of gene functions.
2. Annotate all genes using this vocabulary.
Map: genes  papers  biological functions.
(plenty room for Natural Language Processing)
Used to catalog human gene functions, and also
which genes are expressed where,
what defects have been found when
certain genes are mutated, etc.
http://cs173.stanford.edu [BejeranoWinter12/13]
31
Review
Lecture 3
• Central dogma recap
– Focus on protein coding genes
• Gene structure
– exon, intron, 3’/5’ utr, CDS recap
– The genetic code
– UCSC genome browser sneak peak
– human genome stats
– Gene finding I: ab initio
• Gene (protein) function
– Cell structure, chemical reactions etc
– Pathways (vs. function)
– information processing roles
•
•
TFs
signaling: ligands, receptors, kinases
• Gene families
– similar sequence -> structure -> function
– protein domains
– splice variants, alt promoters
• Special cases
– Pseudogenes
– Retroposed genes (and the distinction between the two)
• Gene ontologies
http://cs173.stanford.edu [BejeranoWinter12/13]
32
Related documents