Download Biological sequence analysis

Document related concepts

Vectors in gene therapy wikipedia , lookup

Epitranscriptome wikipedia , lookup

Transposable element wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene regulatory network wikipedia , lookup

Biosynthesis wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Molecular ecology wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Genetic code wikipedia , lookup

Gene expression wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Gene wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Point mutation wikipedia , lookup

RNA-Seq wikipedia , lookup

Homology modeling wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Stat 246, Lecture 3
Biological sequence analysis
The objects of our study
DNA, RNA and proteins: macromolecules which are
unbranched polymers built up from smaller units.
DNA: units are the nucleotide residues A, C, G and T
RNA: units are the nucleotide residues A, C, G and U
Proteins: units are the amino acid residues A, C, D, E,
F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W and Y.
To a considerable extent, the chemical properties of
DNA, RNA and protein molecules are encoded in the
linear sequence of these basic units: their primary
structure.
The use of statistics to study linear
sequences of biomolecular units
Can be descriptive, predictive or everything
else in between…..almost business as usual.
Stochastic mechanisms should never be taken
literally, but nevertheless can be amazingly
useful.
Care is always needed: a model or method can
break down at any time without notice.
Biological confirmation of predictions is almost
always necessary.
The statistics of biological sequences
can be global or local
Base composition of genomes:
E. coli: 25% A, 25% C, 25% G, 25% T
P. falciparum: 82%A+T
Translation initiation:
ATG is the near universal motif indicating the
start of translation in DNA coding sequence.
From certainty to statistical models: a brief case study
1 ZNF: Cys-Cys-His-His zinc finger DNA binding domain
Cys-Cys-His-His zinc finger DNA binding domain
Its characteristic motif has regular expression
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
1ZNF: XYKCGLCERSFVEKSALSRHQRVHKNX
.
http://www.isrec.isb-sib.ch/software/PATFND_form.html
c.{2,4}c...[livmfywc]........h.{3,5}h
PatternFind output
[ISREC-Server] Date: Wed Aug 22 13:00:41 MET 2001
...
gp|AF234161|7188808|01AEB01ABAC4F945 nuclear protein
NP94b [Homo sapiens] Occurences: 2
Position : 514 CYICKASCSSQQEFQDHMSEPQH
Position : 606 CTVCNRYFKTPRKFVEHVKSQGH
........
gp|X67787|1326037|02AF953C84E0AB5A zinc finger protein
[Saccharomyces cerevisiae] Occurences: 1
Position : 3 CSFDGCEKVYNRPSLLQQHQNSH
200 matches found: output limit reached
This search could have been conducted using a suffix tree representation.
Regular expressions can be limiting
C AGGT A AGT
A
G
TC ≥11 N TC AGC
5’ splice junction in eukaryotes
3’ splice junction
Most protein binding sites are characterized by some
degree of sequence specificity, but seeking a consensus
sequence is often an inadequate way to recognize sites.
Position-specific distributions came to represent
the variability in motif composition.
Cys-Cys-His-His profile: sequence logo form
A sequence logo is a scaled position-specific a.a.distribution.
Scaling is by a measure of a position’s information content.
Sequence logos (T.D. Schneider)
A visual representation of a position-specific distribution.
Easy for nucleotides, but we need colour to depict up to 20
amino acid proportions.
Idea: overall height at position l proportional to information
content (2-Hl); proportions of each nucleotide ( or amino
acid) are in relation to their observed frequency at that
position, with most frequent on top, next most frequent
below, etc..
How do we search with position-specific distributions?
Position-specific scoring matrices
T
A
T
A
A
T
A
0
1
0
1
1
0
A
-38
C
0
0
0
0
0
0
C
G
0
0
0
0
0
0
T
1
0
1
0
0
1
consensus
T
A
19
T
A
T
12
10 -48
-15 -38
-8 -10
-3 -32
G
-13 -48
-6
T
17 -32
8
PSSM
1
A
-7 -10 -48
-9
-6
19
Use of a PSSM to find sites
C
Move the matrix
along the sequence
and score each
“window”.
T
A
12
10 -48
C -15 -38
-8 -10
-3 -32
G -13 -48
-6
17 -32
8
-9
-6
19
A -38
19
1
12
10 -48
C -15 -38
-8 -10
-3 -32
G -13 -48
-6
17 -32
8
-9
-6
19
A -38
19
1
12
10 -48
C -15 -38
-8 -10
-3 -32
G -13 -48
-6
17 -32
8
A -38
T
Peaks should
occur at the “true”
sites.
Of course in general
any threshold will
have some false
positive and false
negative rate.
T
T
19
T
A
1
A
T
C
sum
-93
-7 -10 -48
+85
-7 -10 -48
-7 -10 -48
-9
-6
19
-95
Calculation of a PSSM from counts
A
9
214
63
142
118
8
A
0.04
0.88
0.26
0.59
0.49
0.03
C
22
7
26
31
52
13
C
0.09
0.03
0.11
0.13
0.22
0.05
G
18
2
29
38
29
5
G
0.07
0.01
0.12
0.16
0.12
0.02
T
193
19
124
31
43
216
T
0.80
0.08
0.51
0.13
0.18
0.89
Relative frequencies: fbl
Counts from 242 known sites
2
A -2.76
1.82
0.06
1.23
0.96 -2.92
C -1.46 -3.11 -1.22 -1.00 -0.22 -2.21
1
G -1.76 -5.00 -1.06 -0.67 -1.06 -3.58
T
1.67 -1.66
1.04 -1.00 -0.49
1.84
0
1
PSSM:
log fbl/pb
2
3
4
5
6
Informativeness: 2+∑bpbllog2pbl
Derivation of PSSM entries
Suppose that we have aligned sequence data on a number of
instances of a given type of site.
Candidate sequence
CTATAATC....
Aligned position
123456
Hypotheses:
S=site (and independence)
R=random (equiprobable, independence)
 pr(CTATAA | S) 
= log2 .09x.03x.26x.13x.51x.01 
pr(CTATAA | R) 
.25x.25x.25x.25x.25x.25 
log2 
= (2+log2.09)+...+(2+log2.01)
=
1
-15- 32 + 1 - 9 + 10 - 48
10
More generally, PSSM score sbl = log fbl/pb
l=position, b=base
pb=background frequency
Representation of motifs: further steps
Missing from the position-specific distribution
representation of motifs are good ways of
dealing with:
Length distributions for insertions/deletions
Cross-position association of amino acids
Hidden Markov models help with the first.
The second remains a hard unsolved problem.
Hidden Markov models
Processes {(St,Ot), t=1,…}, where St is the hidden
state and Ot the observation at time t, such that
pr(St | St-1,Ot-1,St-2 ,Ot-2 …) = pr(St | St-1)
pr(Ot | St-1,Ot-1,St-2 ,Ot-2 …) = pr(Ot | St, St-1)
The basics of HMMs were laid bare in a series of
beautiful papers by L E Baum and colleagues around
1970, and their formulation has been used almost
unchanged to this day.
Hidden Markov models:extensions
Many variants are now used. For example, the distribution of O
may not depend on previous S but on previous O values,
pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St ), or
pr(Ot | St , St-1 , Ot-1 ,.. ) = pr(Ot | St , St-1 ,Ot-1) .
Most importantly for us, the times of S and O may be
decoupled, permitting the Observation corresponding to State
time t to be a string whose length and composition depends on
St (and possibly St-1 and part or all of the previous
Observations). This is called a hidden semi-Markov or
generalized hidden Markov model.
A simple HMM
bA (i)  1 / 6
PAB
PBB
A
PAA
Initial distribution:
  ( A,  B)
PBA
bB (i)  1/ 4
A lattice view
Observed sequence:
1
4
3
6
6
4
A
A
A
B
A
B
Hidden sequence:
B
B
A
Observed:
1,4,3,6,6,4...
Questions:
1. What is the most likely die sequence?
2. What is the probability of the observed
sequence?
3. What is the probability that the 3rd state is B,
given the observed sequence?
The HMM algorithms
Forward:
 t (i) = P(observed sequence, ending in state i at base t)
Backward:
ß t (i) = P(obs. after t | ending in state i at base t)
Viterbi:
 t (i) = max P(obs. , ending in state i at base t)
Questions:
1. What is the most likely die sequence? Viterbi
2. What is the probability of the observed sequence? Forward
3. What is the probability that the 3rd state is B, given the
observed sequence? Backward
The forward algorithm
 t (i)
= P(observed sequence, ending in state i at base t)
N
t (j)  (  t  1(i)pij )b j (Yt )
i1
N
N
sequence)
(j) (i)
= P(observed
(p )b(Y
(i)
p
)b
(Y
)
t (j)  (  Tt (i)
)
ij j t
1
ij j t t 1
i1
i1
The forward algorithm
State
N
...
...
...
1
2
3
3
2
1
...
...
...
...
Observation
T
Some early applications of HMMs
finance, but we never saw them
speech recognition
modelling ion channels
In the mid-late 1980s HMMs entered genetics and
molecular biology, and they are now firmly entrenched.
Some current applications of HMMs to biology
mapping chromosomes
aligning biological sequences
predicting sequence structure
inferring evolutionary relationships
finding genes in DNA sequence
Profile HMM: m=match state, I-insert state, d=delete state; go from
left to right. I and m states output amino acids; d states are ‘silent”.
d1
d2
d3
d4
I0
I1
I2
I3
I4
m0
m1
m2
m3
m4
Start
m5
End
Pfam domain-HMMs
Pfam is a library of models of recurrent protein “domains”.
They are constructed semi-automatically using hidden
Markov models (HMMs).
Pfam families have permanent accession numbers and
contain functional annotation and cross-references to other
databases, while Pfam-B families are re-generated at each
release and are unannotated.
See http://www.sanger.ac.uk/Software/Pfam/
http://www.cgr.ki.se/Pfam/
http://pfam.wustl.edu/
http://pfam.jouy.inra.fr/
Finding genes in DNA sequence
This is one of the most challenging and interesting problems in
computational biology at the moment. With so many genomes
being sequenced so rapidly, it remains important to begin by
identifying genes computationally.
What is a (protein-coding)
gene?
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
What is a gene, ctd?
In general the transcribed sequence is longer than the
translated portion: parts called introns (intervening
sequence) are removed, leaving exons (expressed
sequence), and yet other regions remain untranslated.
The translated sequence comes in triples called
codons, beginning and ending with a unique start
(ATG) and one of three stop (TAA, TAG, TGA) codons.
There are also characteristic intron-exon boundaries
called splice donor and acceptor sites, and a variety of
other motifs: promoters, transcription start sites, polyA
sites,branching sites, and so on.
All of the foregoing have statistical characterizations.
In more detail
(color ~state)
Some facts about human genes
Comprise about 3% of the genome
Average gene length: ~ 8,000 bp
Average of 5-6 exons/gene
Average exon length: ~200 bp
Average intron length: ~2,000 bp
~8% genes have a single exon
Some exons can be as small as 1 or 3 bp.
HUMFMR1S is not atypical: 17 exons 40-60 bp
long, comprising 3% of a 67,000 bp gene
The idea behind a GHMM genefinder
States represent standard gene features:
intergenic region, exon, intron, perhaps more
(promotor, 5’UTR, 3’UTR, Poly-A,..).
Observations embody state-dependent base
composition, dependence, and signal features.
In a GHMM, duration must be included as well.
Finally, reading frames and both strands must
be dealt with.
Half a model for a genefinder
Splice sites can be included in the exons
Beyond position-specific distributions
The bases in splice sites exhibit dependence, and
not simply of the nearest neighbor kind.
High-order (non-stationary) Markov models would
be one option, but the number of parameters in
relation to the amount of data rules them out. The
class of variable length Markov models (VLMMs)
deriving from early research by Rissanen prove to
be valuable in this context. However, there is
likely to be room for more research here.
GENSCAN (Burge & Karlin)
E0
I0
E1
I1
Ei
Et
3'UTR
poly-A
promoter
Reverse (-) strand
I2
Es
5'UTR
Forward (+) strand
E2
intergenic
region
Forward (+) strand
Reverse (-) strand
62001
AGGACAGGTA CGGCTGTCAT CACTTAGACC TCACCCTGTG GAGCCACACC
62051
CTAGGGTTGG CCAATCTACT CCCAGGAGCA GGGAGGGCAG GAGCCAGGGC
62101
TGGGCATAAA AGTCAGGGCA GAGCCATCTA TTGCTTACAT TTGCTTCTGA
62151
CACAACTGTG TTCACTAGCA ACCTCAAACA GACACCATGG TGCACCTGAC
62201
TCCTGAGGAG AAGTCTGCCG TTACTGCCCT GTGGGGCAAG GTGAACGTGG
62251
ATGAAGTTGG TGGTGAGGCC CTGGGCAGGT TGGTATCAAG GTTACAAGAC
62301
AGGTTTAAGG AGACCAATAG AAACTGGGCA TGTGGAGACA GAGAAGACTC
62351
TTGGGTTTCT GATAGGCACT GACTCTCTCT GCCTATTGGT CTATTTTCCC
62401
ACCCTTAGGC TGCTGGTGGT CTACCCTTGG ACCCAGAGGT TCTTTGAGTC
62451
CTTTGGGGAT CTGTCCACTC CTGATGCTGT TATGGGCAAC CCTAAGGTGA
62501
AGGCTCATGG CAAGAAAGTG CTCGGTGCCT TTAGTGATGG CCTGGCTCAC
62551
CTGGACAACC TCAAGGGCAC CTTTGCCACA CTGAGTGAGC TGCACTGTGA
62601
CAAGCTGCAC GTGGATCCTG AGAACTTCAG GGTGAGTCTA TGGGACCCTT
62651
GATGTTTTCT TTCCCCTTCT TTTCTATGGT TAAGTTCATG TCATAGGAAG
62701
GGGAGAAGTA ACAGGGTACA GTTTAGAATG GGAAACAGAC GAATGATTGC
62751
ATCAGTGTGG AAGTCTCAGG ATCGTTTTAG TTTCTTTTAT TTGCTGTTCA
62801
TAACAATTGT TTTCTTTTGT TTAATTCTTG CTTTCTTTTT TTTTCTTCTC
62851
CGCAATTTTT ACTATTATAC TTAATGCCTT AACATTGTGT ATAACAAAAG
62901
GAAATATCTC TGAGATACAT TAAGTAACTT AAAAAAAAAC TTTACACAGT
62951
CTGCCTAGTA CATTACTATT TGGAATATAT GTGTGCTTAT TTGCATATTC
63001
ATAATCTCCC TACTTTATTT TCTTTTATTT TTAATTGATA CATAATCATT
63051
ATACATATTT ATGGGTTAAA GTGTAATGTT TTAATATGTG TACACATATT
63101
GACCAAATCA GGGTAATTTT GCATTTGTAA TTTTAAAAAA TGCTTTCTTC
63151
TTTTAATATA CTTTTTTGTT TATCTTATTT CTAATACTTT CCCTAATCTC
63201
TTTCTTTCAG GGCAATAATG ATACAATGTA TCATGCCTCT TTGCACCATT
63251
CTAAAGAATA ACAGTGATAA TTTCTGGGTT AAGGCAATAG CAATATTTCT
63301
GCATATAAAT ATTTCTGCAT ATAAATTGTA ACTGATGTAA GAGGTTTCAT
63351
ATTGCTAATA GCAGCTACAA TCCAGCTACC ATTCTGCTTT TATTTTATGG
63401
TTGGGATAAG GCTGGATTAT TCTGAGTCCA AGCTAGGCCC TTTTGCTAAT
63451
CATGTTCATA CCTCTTATCT TCCTCCCACA GCTCCTGGGC AACGTGCTGG
63501
TCTGTGTGCT GGCCCATCAC TTTGGCAAAG AATTCACCCC ACCAGTGCAG
63551
GCTGCCTATC AGAAAGTGGT GGCTGGTGTG GCTAATGCCC TGGCCCACAA
63601
GTATCACTAA GCTCGCTTTC TTGCTGTCCA ATTTCTATTA AAGGTTCCTT
63651
TGTTCCCTAA GTCCAACTAC TAAACTGGGG GATATTATGA AGGGCCTTGA
63701
GCATCTGGAT TCTGCCTAAT AAAAAACATT TATTTTCATT GCAATGATGT
Remark
In general the problem of identifying (annotating)
human genes is considerably harder than ß-globin
might suggest.
The human factor VIII gene (whose mutations
cause hemophilia A) is spread over ~186,000 bp.
It consists of 26 exons ranging in size from 69 to
3,106 bp, and its 25 introns range in size from 207
to 32,400 bp. The complete gene is thus ~9 kb of
exon and ~177 kb of intron.
The biggest human gene yet is for dystrophin. It has
> 30 exons and is spread over 2.4 million bp.
Challenges in the analysis of sequence data
Understanding the biology well enough to begin.
Designing HMM architecture, e.g. in Marcoil for coiled-coils.
Modelling the parts, e.g. VLMMs for splice sites.
Coding = software engineering, is the hardest and most
important task of all: making it all work.
Obtaining good data sets for use in careful evaluation and
comparison with competing algorithms; designing the studies.
Opportunities for methodological research.
Topics not mentioned include
Molecular evolution, including phylogenetic inference
(building trees from aligned sequence data)
Sequence alignment (pairwise, multiple), including
use of Gibbs sampler
Stochastic context-free grammar models and the
analysis of RNA sequence data.
Acknowledgements
Mauro Delorenzi (WEHI)
Simon Cawley (Affymetrix)
Tony Wirth (CS, Princeton)
Lior Pachter (Math, UCB)
Marina Alexandersson (Stat, UCB)
References
Biological Sequence Analysis
R Durbin, S Eddy, A Krogh and G Mitchison
Cambridge University Press, 1998.
Bioinformatics The machine learning approach
P Baldi and S Brunak
The MIT Press, 1998
Post-Genome Informatics
M Kanehisa
Oxford University Press, 2000
HMMs representing coiled-coil domains
2TMA: Tropomyosin
Coiled-coil domains, schematically
dimeric parallel helices, heptad repeats, knobs-into-holes
(hydrophobic)
dislikes water
Designing
the HMM, I
Designing
the HMM, 2
Designing
the HMM, 3
HMM: decoding
Sequence
Labels
Path1
Path2
WGP ARQLNES VKDINKM LER HP
BBB CCCCCCC CCCCCCC CCC BB
000 abcdefg abcdefg abc 00
00c defgabc defgabc def g0
VITERBI decoding: of all possible state-paths, we determine the
maximum probability one given the amino acid sequence O;
POSTERIOR decoding: at each position, we determine the state
with the highest probability given O.
Issue: how to measure the strength of a potential CC domain, and
how should this depend on the length of a domain?
CC-PROBABILITY PROFILE
Fusion protein of simian parainfluenza virus 5
Assessing performance: terms
TP true positive: a predicted fragment that
overlaps the annotated fragment (aa in the
annotated region)
FP false positive: a predicted fragment does not
overlap any annotated fragment (aa not in the
annotated region)
LS learning set of sequences
NTS negative test set; sequences with no CCD
used for estimating FP
PTS positive test set; used for estimating TP
Much care/effort required to create these sets
Assessing performance: study design
Study the variability of performance under variation
of the sequences used for determining the model
parameters.
Compare methods using the same set of
aa-frequencies / emission probabilities.
Use the same set of domains for learning and testing
instead of testing on different protein families.
Choose a number of FP-rates and calculate the
corresponding TP-rates (ROC curve).
PTS subdivided 150 times at random (stratified)
into 2/3 for learning and the rest for testing.
LEARNING PHASE => PARAMETERS
TESTING ON NTS => THRESHOLDS
TESTING ON PTS => TP-VALUES
ANALYSIS OF RESULTS
150 X
Assessing performance: summaries
TP-rate at given FP-rate: per family / per length-class
TP- and FP-rates for aa’s: per family / per length-class
Accuracy: at the borders and of length prediction
The algorithms
As the name suggests, with an HMM the series
O = (O1,O2 ,O3 ,……., OT) is observed, while
the states S = (S1 ,S2 ,S3 ,……., ST) are not.
There are elegant algorithms for calculating
pr(O|), arg max pr(O|) in certain special
cases, and arg maxS pr(S|O,).
Here  are the parameters of the model, e.g.
transition and observation probabilities.