Download PPT (Gene)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistical modeling and
classification in
Biological Sequence Space
April 26, 04; 9.520
Gene Yeo
Poggio, Burge @MIT
Framework/Issues
• “Build” models around known biology
– In the process, extend knowledge about known
biology
• “Predict” new examples
• “Validate” predictions by
–
–
–
–
prediction accuracy
experimental validation
higher-level traits of predictions
conservation in other genomes
Biological sequences
• DNA, RNA and proteins: macromolecules built up from
smaller units.
•
•
•
DNA: units are the nucleotide residues A, C, G and T
RNA: units are the nucleotide residues A, C, G and U
Proteins: units are the amino acid residues A, C, D, E,
F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y.
•
To a considerable extent, the chemical properties of
DNA, RNA and protein molecules are encoded in the
linear sequence of these basic units: their primary
structure.
• Statistical models can be descriptive and/or predictive.
• Given known biological signal-> describe the signal with
statistical modeling & find unknown examples of the same
signal
– Gene-finding (protein-coding genes)
– Noncoding RNA genes
– Protein domains
• Warning: although successful, models are not to be taken
literally.
• Most important: biological confirmation of predictions is
almost always necessary.
Different models
Complexity
RNA gene
(Covariation,SCFG,NN,SVM)
Protein structure
(a variety of
methods)
Protein gene(HMM,NN)
Splice site motif (WMM, MM, SVM, NN)
DNA
RNA
Protein
A case study in computational
biology: modeling signals in genes
With so many genomes being sequenced, it remains
important to be able to identify genes and the signals within
and around genes computationally.
What is a (protein-coding) gene?
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
Some facts about human genes
Comprise about 3% of the genome
Average gene length: ~ 8,000 bp
Average of 5-6 exons/gene
Average exon length: ~200 bp
Average intron length: ~2,000 bp
~8% genes have a single exon
The idea behind a HMM genefinder
• States represent standard gene features:
intergenic region, exon, intron, perhaps more
(promotor, 5’UTR, 3’UTR, Poly-A,..).
•
Observations embody state-dependent
statistics, such as base composition, dependence,
and signal features.
GENSCAN (Burge & Karlin)
E0
I0
E1
I1
Ei
Et
3'UTR
poly-A
promoter
Reverse (-) strand
I2
Es
5'UTR
Forward (+) strand
E2
intergenic
region
Forward (+) strand
Reverse (-) strand
62001
AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC
62051
CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC
62101
TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA
62151
CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC
62201
TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG
62251
ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC
62301
AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC
62351
TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC
62401
ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC
62451
CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA
62501
AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC
62551
CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA
62601
CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT
62651
GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG
62701
GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC
62751
ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA
62801
TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC
62851
CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG
62901
GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT
62951
CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC
63001
ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT
63051
ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT
63101
GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC
63151
TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC
63201
TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT
63251
CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT
63301
GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT
63351
ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG
63401
TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT
63451
CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG
63501
TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG
63551
GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA
63601
GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT
63651
TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA
63701
GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT
Splice sites can be an important signal
Regular expressions can be limiting
C AGGT A AGT
A
G
TC ≥11 N TC AGC
5’ splice junction in eukaryotes
3’ splice junction
Most protein binding sites are characterized by some
degree of sequence specificity, but seeking a consensus
sequence is often an inadequate way to recognize sites.
Position-specific distributions came to represent
the variability in motif composition.
Position-specific scoring matrix (PSSM)
Pos
-3
-2
-1
+1
+2
+3
+4
+5
+6
0.1
0.1
0.1
0.2
A
C
0.3
0.4
0.6
0.1
0.1
0.0
0.0
0.0
0.0
0.0
0.4
0.1
0.7
0.1
G
0.2
0.2
0.8
1.0
0.0
0.4
0.1
0.8
0.2
T
0.1
0.1
0.1
0.0
1.0
0.1
0.1
0.0
0.5
S = S1 S2 S3 S4 S5 S6 S7 S8 S9
Odds Ratio R =
P(S|+)
P(S|-)
Score s = log2R
=
P-3(S1)P-2(S2)P-1(S3) ••• P5(S8)P6(S9)
Pbg(S1)Pbg(S2)Pbg(S3) ••• Pbg(S8)Pbg(S9)
Ok, so we got the genes
• molecular biology (transcription, splicing)
• signals are modeled as states (HMM) or
separately, i.e.PSSMs
• Here’s another catch, there isn’t just one
version of each gene.
• But sometimes several
Eg. alternative splicing - CD44
Human chromosome 11p…
Zhu et al Science… (2003)
Alternative splicing
• is a major determinant of protein
diversity (Lander 2001, Zavolan 2003)
• 30-50% of human diseases involve alt.
splicing
Defining constitutive and alternative exons
Constitutive
exon
Skipped exon
3’ alternative
exon
5’ alternative
exon
Intron
retention
Mutually
exclusive exons
Fragile X Related Gene, FXR1
Conserved alternative, skipped exon - FXR1
Myotonic Dystrophy-containing WD Repeat, DMWD
Another example of genes containing CSE: DMWD
Predicting new alternatively spliced
exons
1.
The problem is ‘ill-posed’
2. High-dimensional space
3. Not overfit data
4. Simple feature selection
5. Unbalanced data set sizes
6. Labels are more “flexible”
Eg. of experimentally validated
Biological sequence space:
challenges
• Models that “represent” as much of the biology as
possible.
• Biologically motivated features are important
• Validating attributes:
– Conservation of events are key in computational biology
– Higher-level consistency with known biology
• Experimental validation of predictions are
essential
Framework/Issues
• “Build” models around known biology
– In the process, extend knowledge about known
biology
• “Predict” new examples
• “Validate” predictions by
–
–
–
–
prediction accuracy
experimental validation
higher-level traits of predictions
conservation in other genomes
If time permits
Modeling higher order
interactions: Yeast Phe tRNA
Secondary Structure
Tertiary Structure
The Hammerhead Ribozyme
Secondary structure
Tertiary structure
One example on how to model and predict
o
RNA 2 Structure
• Covariation (using comparative genomics)
Seq1:
A C G A A A G U
Seq2:
U A G U A A U A
Seq3:
A G G U G A C U
Seq4:
C G G C A A U G
Seq5:
G U G G G A A C
Method of Covariation / Compensatory changes
Mutual information statistic
for pair of columns in a multiple alignment
M ij  
x, y
f
f
(i, j)
x, y
(i, j )
f log
x, y
2
f
(i, j )
x, y
(i )
( j)
x
y
f f
= fraction of seqs w/ nt. x in col. i, nt. y in col. j
(i )
x
= fraction of seqs w/ nt. x in col. i
sum over x, y = A, C, G, U
M
is maximal (2 bits) if x and y individually appear
at random (A,C,G,U equally likely), but are perfectly
correlated (e.g., always complementary)
ij
o
Inferring 2 Structure from Covariation
Stochastic Context-Free
Grammars (SCFGs)
• A generalized model which is capable of
handling non-local dependencies between
words in a language (or bases in an RNA)
Ref:
Durbin et al. “Biological Sequence Analysis” 1998
An SCFG Model of RNA 2o Structure
“Production Rules”:
• P  aWb
• L  aW
• R  Wa
• B  SS
• S  W
• E  
(“pair”)
(“left bulge/loop”)
(“right bulge/loop”)
(“bifurcation”)
(“start”)
(“end”)
last page
• some of the slides were obtained from various
places:
– available online slides on the web (primarily from lectures
by terry speed).
– slides from chris burge, dirk holste
Related documents