Download 1. SVM example: Computational Biology Assume a fixed species

Document related concepts

Eukaryotic transcription wikipedia , lookup

Transcription factor wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene expression profiling wikipedia , lookup

Protein moonlighting wikipedia , lookup

Molecular evolution wikipedia , lookup

RNA polymerase II holoenzyme wikipedia , lookup

Point mutation wikipedia , lookup

Gene regulatory network wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genomic library wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene expression wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Expression vector wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Transcript
1. SVM example: Computational Biology
Assume a fixed species f (e.g. baker's yeast, s. cerevisae)
has genome Z (collection of genes).
Typically a transcription factor (TF) > binds to the promoter
(upstream) DNA near 1 and initiates transcription.
In this case we say 1 is a target of >.
Question: given a fixed TF >, for which genes 1 − Z are its
targets?
Chemically hard to solve:
Fig. 1: Left: DNA binding of GCM; right: binding of Fur (C.
Ehmke, E. Pohl, EMBL, 2005)
Try machine learning:
Consider a training data set (genes known as targets or
non-targts of >)
H! œ ÖÐ13 ß C3 Ñ×83œ" ß
where 13 is a sample gene and
C3 œ œ
"
"
if 13 is target
.
otherwise
For all genes 1 define function
"
0! (1Ñ œ C œ œ
"
if 13 is target
otherwise
How to learn the general function 0! À Z Ä  from examples
in H?
Start by representing 1 as a feature vector
x œ ÒB" ß B# ß á ß B. ÓX
with 'useful' information about 1 and its promoter region.
What is useful here?
Consider
EGKKX GX KKX ÞÞÞGKX œ promoter DNA sequence of 1
œ upstream region:
This is where TF typically binds; has ~1000 bases
Feature maps
2. Feature maps
One useful choice of feature vector:
Example: Consider an ordered list of possible strings of
length ':
Feature maps
string1 AAAAAA
string2 AAAAAC
string3 AAAAAG
string4 AAAAAT
string5 AAAACA
ã
ã
of all sequences of 6 base pairs.
Feature maps
Given gene 1, choose feature vector x œ ÒB" ß á ß B. Ó, where
B" œ # appearances of string " in promoter of 1
B2 œ # appearances of string # in promoter of 1
etc.
Consider feature map F which takes 1 to x À
FÐ1Ñ œ xÞ
Number of possible strings of length ' is 4' œ 4,096.
Feature maps
Thus . œ 4,096.
Ê x œ ÒB" ß á ß B. ÓX − ‘%ß!*' ´ J
Thus let J œ all possible x œ feature space (a vector space)
Henceforth replace 1 by its feature vector x œ xÐ1ÑÞ
To classify 1 we classify xÞ
Using x œ xÐ1Ñ goal is to find
Feature maps
"
0 Ð xÑ œ C œ œ
"
if 1 is target
if 1 is not target
Thus: 0 maps a sequence x of string counts in J into a yes
or no.
Replace 13 by x3 œ xÐ13 Ñ
H! œ ÖÐ13 ß C3 Ñ×3 Ä H œ ÖÐx3 ß C3 Ñ×ß
Thus given examples 0 Ðx3 Ñ œ C3 for the feature vectors x3 in
our sample, want to generalize and find 0 ÐxÑ for all x.
Feature maps
With data set H, can we find the right function 0 À J Ä „ "
which generalizes the above examples, so that 0 ÐxÑ œ C for
all feature vectors?
Easier: find a 0 À J Ä real numbers, where
0 ÐxÑ  ! if C œ "à 0 ÐxÑ  ! if C œ "Þ
Resulting predictive accuracy depends on the number of
features used, i.e., what the components B3 of x mean.
For example, B#%* can be how many times ACGGAT
appears.
B%ß&*)
Feature maps
can be how many times ACG_ _ _GAT appears
(i.e., any 3 letters allowed to go in middle)
Generally choose the most significant features - the
most helpful ones in discrimination.
For TF YIR018W, accuracy in prediction of targets vs.
number of features:
Feature maps
Nonlinear kernels
SVM: Nonlinear Feature Maps and Kernels
http://www.youtube.com/watch?v=3liCbRZPrZA
1. General SVM: when O (xß yÑ is not an ordinary dot
product
Recall: In yeast, Z œ genome (all genes).
Given fixed transcription factor >, want to determine which
genes 1 − Z bind to >.
Nonlinear kernels
Have a feature map x À Z Ä J œ ‘. with J the feature
space, with
xÐ1Ñ œ x œ feature vector
for gene 1.
For each 1 define
CÐ1Ñ œ œ
"
"
if 1 binds
.
otherwise
We want a map 0 ÐxÑ which classifies genes. That is, for
1 − Z ß with feature vector x œ FÐ1Ñ we want
Nonlinear kernels
0 Ð xÑ œ
!
Ÿ!
if CÐ1Ñ œ "
Þ
if CÐ1Ñ œ "
Have examples Ö13 ×83œ" § Z of genes with known binding,
together with CÐ13 Ñ. Define
x3 œ xÐ13 Ñ
to be feature vectors of the examples.
The SVM provides 0 of the form
Nonlinear kernels
0 Ð xÑ œ w † x  , ;
0 ÐxÑ  ! yields conclusion C œ " (binding gene) and
otherwise C œ ".
Thus have linear separation of points in J .
What about nonlinear separations?
1. Replace base space Z by J (i.e., replace gene by its
feature vector)
Nonlinear kernels
Thus have collection of examples ÖÐx3 ß C3 Ñ×83œ" for of feature
vectors x3 for which binding C3 is knownÞ
Desire a new (possibly nonlinear) function 0 ÐxÑ which is
positive when x is feature vector of binding gene and
negative otherwise.
2. With J now as base space, define new feature map
F À J Ä J" (now may be nonlinear but continuous).
Map the collection of examples into J" Þ Thus new set of
examples is
Nonlinear kernels
Öz3 ´ ÐFÐx3 Ñß C3 Ñ×83œ" Þ
Induce a linear SVM in J" (SVM algorithm above).
Nonlinear kernels
3. New decision rule:
0" ÐFÐxÑÑ ´ w" † FÐxÑ  , .
If 0" ÐFÐxÑÑ  ! we conclude C œ " (gene binds) and
otherwise C œ ".
Equivalent rule on original J :
0 ÐxÑ œ 0" ÐFÐxÑÑÞ
Allows arbitrary nonlinear separating surfaces on J .
Kernel trick
2. The kernel trick
Equivalently to above: assume w" œ FÐwÑ for some w.
pNew decision rule:
0 ÐxÑ œ w" † FÐxÑ  , œ FÐwÑ † FÐxÑ  ,Þ
Define standard linear kernel function on J :
OÐxß yÑ œ FÐxÑ † FÐyÑ
(as before ordinary dot product).
Now back in J , can show OÐxß yÑ is a Mercer kernel:
Kernel trick
(a) OÐxß yÑ œ OÐyß xÑ
(b) OÐxß yÑ is positive definite.
Indeed, given any set Öx3 ×83œ" ,
OÐx3 ß x4 Ñ œ FÐx3 Ñ † FÐx4 Ñ œ u3 † u4
with u3 œ FÐx3 Ñ. We already know ordinary dot product
makes pos. def. kernel.
(c) O is continuous because F is cont.
Kernel trick
Like any kernel function OÐxß yÑ satisfies certain properties
of inner product, and so can be thought of as a new dot
product on J .
Thus
0 ÐxÑ œ OÐwß xÑ  ,Þ
With the redefined dot product w † x ´ OÐwß xÑ, training
SVM is identical to before - we have already developed the
algorithm here - just replace old dot product by the new one.
Conclusion: The introduction of nonlinear separators for
SVM via replacement of x − J by a nonlinear function FÐxÑ
Kernel trick
is exactly equivalent to replacement of the dot product w † x
by OÐwß xÑ with O a Mercer kernel!
This is equivalent to replacing the standard linear kernel
OÐwß xÑ œ w † x (linear)
by a general nonlinear kernel, e.g., the Gaussian kernel:
OÐwß xÑ œ /lwxl
#
Recall advantages: the calculation of w and , involve linear
algebra involving the matrix O34 œ OÐx3 ß x4 ÑÞ
Gaussian kernel
3. Examples
Ex 1: Gaussian kernel
O5 Ðxß yÑ œ /
[can show pos. def. Mercer kernel]
lxyl#
#5 #
Gaussian kernel
SVM: from (4) above have
0 ÐxÑ œ "+4 OÐxß x4 Ñ  , œ "+4 /
4

|xx4 l#
#5 #
 ,ß
4
where examples x4 in J have known classifications C4 , and
+4 ß , are obtained by quadratic programming.
What kind of classifier is this? It depends on 5 (see Vert
movie).
Note Movie1 varies 5 in the Gaussian (5 œ _ corresponds
to a linear SVM)à then movie2 varies the margin lw" l (in
Gaussian kernel
Gaussian feature space J# ) as determined by changing - or
equivalently G œ #-"8 Þ
4. Software available
Software which implements the quadratic programming
algorithm above includes:
• SVMLight: http://svmlight.joachims.org
• SVMTorch: http://www.idiap.ch/learning/SVMTorch.html
• LIBSVM: http://wws.csie.ntu.edu.tw/~cjlin/libsvm
A Matlab package which implements most of these is
Spider:
http://www.kyb.mpg.de/bs/people/spider/whatisit.html
Computational Biology Applications
References:
T. Golub et al Molecular Classification of Cancer: Class
Discovery and Class Prediction by Gene Expression.
Science 1999.
S. Ramaswamy et al Multiclass Cancer Diagnosis Using
Tumor Gene Expression Signatures. PNAS 2001.
B. Scholkopf,
T. Tsuda, and J.P. Vert, Kernel Methods in
¨
Computational Biology. MIT Press, 2004.
J.P. Vert, http://cg.ensmp.fr/~vert/talks/060921icgi/icgi.pdf
http://cg.ensmp.fr/%7Evert/svn/bibli/html/biosvm.html
Transcription factor binding
1. Matching genes and TF's:
For yeast gene 1 − Z ß feature vector is x œ FÐ1Ñ.
Examples: data set H œ eÐx3 ß C3 Ñf83œ" of known examples of
feature vectors x3 (together with binding C3 œ „ ") used to
form kernel matrix
O34 œ OÐx3 ß x4 Ñ
OÐxß yÑ œ x † y is linear kernel
OÐxß yÑ œ /
Transcription factor binding
is gaussian kernel
mxym#
OÐxß yÑ œ Ðx † y  "Ñ5 is polynomial kernel
Can show all of these are Mercer Kernels.
Form discriminant function
0 ÐxÑ œ OÐwß xÑ  ,ß
with wß , determined by quadratic programming algorithm
using matrix O34 formed from data set H.
Transcription factor binding
0 ÐxÑ  ! means corresponding gene 1 with feature vector
FÐ1Ñ œ x binds to transcription factor.
What features are we interested in? Characterize 1 by its
upstream region
of about 1000 bases (reading from the &w end of DNA).
Transcription factor binding
Interesting feature maps:
F" Ð1Ñ œ x œ vector of '-string counts
Specifically: take all possible 6-strings (strings of 6
consecutive bases, e.g., EX KEEG ), index them with 3 œ "
to %' œ %!*', and form vector x with
B3 œ # appearances of 3>2 '-mer in upstream region of
corresponding gene 1
(large space!).
Or:
Transcription factor binding
F# Ð1Ñ œ x œ vector of microarray experiment results
Specifically
"
B3 œ œ
!
Or:
if gene 1 is expressed in the 3>2 microarray experimen
otherwise
Transcription factor binding
F$ Ð1Ñ œ vector of gene ontology appearances of 1
i.e., B3 œ 1 if 3>2 ontology term applies to gene 1.
F% Ð1Ñ œ vector of melting temperatures of DNA along
consecutive positions in upstream region
Each feature map F5 yields a different kernel O5 Ðxß yÑ and
Ð5Ñ
kernel matrix O34 Þ
Transcription factor binding
Combination of features: integrate all information into a
large vector (i.e., concatenate the vector strings F5 ÐxÑ into
one:
Fcomb ÐxÑ œ ÐF" ÐxÑß á ß F6 ÐxÑÑ.
This is equivalent to taking a direct sum J of the
corresponding feature spaces J" ß á ß J5 .
How to define inner product in the large feature space J ?
In the obvious way for a concatenation of vectors:
Transcription factor binding
Fcomb ÐxÑ † Fcomb ÐyÑ œ "F5 ÐxÑ † F5 ÐyÑ .
5
Thus the kernel corresponding to feature map Fcomb is given
by
Ocomb Ðxß yÑ œ Fcomb ÐxÑ † Fcomb ÐyÑ
œ "F5 ÐxÑ † F5 ÐyÑ œ "O5 Ðxß yÑÞ
5
5
Thus SVM kernel Ocomb which combines all feature
information is the sum of individual kernels O5 !
Transcription factor binding
So addition of individual kernel matrices automatically
combines their feature information.
Positive predictive values (probability of correct positive
prediction) for combined kernel O reaches approximately
90%.
Reference: Machine learning for regulatory analysis and
transcription factor target prediction in yeast (with D.
Holloway, C. DeLisi), Systems and Synthetic Biology, 2007.
Protein characterization
2. Application: protein sequences:
JP Vert
Protein characterization
Jakkola, et al. (1998) developed feature space kernel
methods for anlyzing and classifying protein sequences.
Applications: classification of proteins into functional vs.
structural classes, cellular localization of proteins, and
knowledge of protein interactions.
Derive kernels (equivalently, appropriate feature maps!) by
starting with choices of feature spaces J .
Protein characterization
Choices of J : we map protein : into FÐ:Ñ − J , where FÐ:Ñ
has information on:
ì physical chemistry properties of protein :
ì strings of amino acids in : (see DNA examples
earlier)
ì motifs (what standard functional portions
appear in : ?Ñ
ì similarity measures, local alignment measures with
other standard proteins
Protein characterization
Additional relevant protein features FÐ:Ñ would be
ì sequence length
ì time series of chemical properties of amino acids in
sequence, e.g.,
hydrophilic properties, polarity
ì do transforms on these series, e.g.
Autocorrelation functions !+> +>5 , with +>
>
running parameter
Fourier transforms
String map: a useful feature map
Protein characterization
Consider a fixed string of length 6: e.g. W5 œ T OX LHV
Define 5 >2 component B5 of feature map
FÐ:Ñ œ xÐ:Ñ by
B5 œ # occurrences of W5 in protein :
Ðspectrum kernel, Leslie, et al. 2002)
or
B5 œ # occurrences of W5 in : with up to Q
mismatches (mismatch kernel, Leslie, et al.
2004)
Protein characterization
or gapped string kernels (have gaps with weights
which decay with number of gaps; substring
kernel, Lohdi, et al., 2002)
For example, given string
PMQ EWKGZ KJ WG
we have spectrum of $-mers:
ÖPMQ ß MQ Eß
Q EWß EWKß WKGß KGZ ß GZ Kß Z KJ ß KJ Wß J WG×
spectrum (string) kernel:
Protein characterization
OÐxß yÑ œ FÐxÑ † FÐyÑ œ "F5 ÐxÑF5 ÐyÑ
5
where x, y are feature vectors
General Observations about kernel methods:
(1) Above examples illustrate advantage of feature space
methods: we are able to summarize amorphous-seeming
information in an object 1 in a feature vector FÐ1Ñ.
Protein characterization
(2) After that a very important advantage is the kernel trick:
we summarize all information about our sample data Öx4 × in
a kernel matrix K34 œ OÐx3 ß x4 Ñ.
This allows representation of very high dimesional data FÐxÑ
in matrix K with size equal to number of examples
(sometimes much smaller than dimension)
Matrix K is all we need to find the +3 in the discriminator
function
0 ÐxÑ œ "+3 OÐxß x3 Ñ  ,
3
Protein characterization
Another approach to forming kernels: similarity kernels
Start with a known collection (dictionary) of sequences
H œ Ðx" ß á ß x8 ÑÞ
Define a similarity measure
=Ðxß yÑÞ
Define a feature vector by similarities to objects in H:
FÐxÑ œ Ð=Ðxß x" Ñß á ß =Ðxß x8 ÑÑÞ
Protein characterization
ì Known as pairwise kernels (Liao, Noble, 2003):
= œ standard distance between strings
ì Motif kernels (Logan, et al., 2001):
=Ðxß x4 Ñ œ distance measure between string and
motif (standard signature sequence) x4
Jakkola's feature map
3. Jakkola's Fisher score map
Jakkola, et al. (1998) studied HMM models for protein
sequences and combined with kernel methods.
1. Form a parametric family of probabilistic models T@ , e.g.,
HMM models of a family of protein sequences, with
@ − H § ‘7 a family of parameters.
2. Find an estimate @! (e.g., Baum-Welsh, maximum
likelihood from some training set)
3 . Form the feature map (Fisher score vector) on sequence
Jakkola's feature map
vector x
F! ÐxÑ œ f@ ln T@ Ðxѹ
@ œ@ !
Þ
4. In feature space define the inner product (kernel
O ) by
O! Ðxß yÑ œ F! ÐxÑ † ˆM!" F! Ðyщß
where
M! œ I@! F! ÐxÑF! ÐxÑX ‘
Jakkola's feature map
(expectation over x under @! ) is the Fisher information
matrix (expectation assuming parameter @! ÑÞ
Advantages of Fisher kernel:
ì Fisher score shows how strongly the probability
T@ ÐxÑ depends on each parameter )3 in @
ì Fisher score F@ ÐxÑ can be computed explicitly, e.g.,
for HMM
ì Different models @3 can be trained and their
kernels
O3 combined
Jakkola's feature map
Results for correct classification of proteins in the G-protein
family as subset of SCOP (nucleotide triphosphate
hydrolases) superfamily:
Jakkola's feature map
Finding kernels
4. Finding kernels - how do we decide
We can find kernels by:
ì Finding a feature map F into J" which separates positive
and negative examples well.
Then
OÐxß yÑ œ FÐxÑ † FÐyÑ.
Finding kernels
ì Defining kernel as a "similarity measure" OÐxß yÑ which
is large when x and y are "similar", given by a positive
definite function OÐxß yÑ with OÐxß xÑ œ " for all xÞ
Rationale: Note here we require for all x in feature space
J" À
lFÐxÑl œ ÈOÐxß xÑ œ "
Finding kernels
So
OÐxß yÑ œ FÐxÑ † FÐyÑ œ lFÐxÑllFÐyÑlcos ) œ cos )Þ
where
) ´ angle between FÐxÑ and FÐyÑÞ
(2)
Finding kernels
So if
x and y are similar
(by desired criterion) then OÐxß yÑ large and by (2) ) small,
i.e.
FÐxÑ close to FÐyÑÞ
Thus similar x and y are close in the new feature space J" ,
and different x and y are far, i.e., can be separated.
Finding kernels
Finding translation initiation sites
5. Finding translation initiation sites (Zien, Ratsch,
Mika, Schölkopf, et al., 2000)
A translation initiation site (TIS) is a DNA location where
coding for a protein starts (i.e., beginning of gene).
Usually determined by codon EX K.
Question: How to determine whether particular EX K is start
codon for a TIS?
Finding translation initiation sites
Strategy: Given a potential start codon EX K at location 3
in the genome À
"Þ Start by looking at 200 nucleotides (nt) around current
EX K:
EGX KEX KX Ká EG î
EX K X EKá EX KGEGG
center
(1)
Finding translation initiation sites
2. Use the unary bit encoding of nucleotides:
œ "!!!!à G œ !"!!!à K œ !!"!!à
X œ !!!"!à
Unknown œ !!
3. Concatenate the unary encodings: replace each nt in (1)
by unary code:
FÐ3Ñ œ ïïïï
"!!!! !"!!! !!!"! !!"!! á − J
E
This becomes feature vectorÞ
G
X
K
Finding translation initiation sites
What kernel to use in this feature space?
Try polynomial kernel:
OÐxß yÑ œ ax † yb œ Œ!B3 C3 
7
7
œ !B3" C3" † !B3# C3# † á † !B37 C37
3
3"
3#
œ " B3" C3" B3# C3# á B37 C37
3" ßáß37
with 7 fixed and xß y − J .
Note:
37
Finding translation initiation sites
ì If 7 œ " then
OÐxß yÑ œ !B3 C3 œ number of common NT (nucleotides)
3
in
strings x and y
ì If 7 œ # then
OÐxß yÑ œ !B3 C3 B4 C4 œ total number of pairs of common
3ß4
NT
Finding translation initiation sites
in x and y (times 2)
ì generally OÐxß yÑ œ # 7-tuples of common NT in x and y
Ðtimes 7x) for fixed 7
Note:
This is a good similarity measure (see previous
section).
6. Better Kernel: Local matches
Now define
Finding translation initiation sites
Q5 Ðxß yÑ œ œ
"
!
if B5 œ C5
otherwise
Define window kernel [ for fixed 5 À
[< Ðxß yÑ œ  " +3 Q<3 Ðxß yÑ
5
7"
3œ5
œ cweighted # matches in window Ð<  5ß <  5Ñd7"
Usefulness: measures correlations in a window of length
#5  " centered at <; less noise.
Finding translation initiation sites
Now add up the weighted matches over center positions <:
Onew Ðxß yÑ œ "[< Ðxß yÑ.
R
<œ"
Recognition error rates for TSS recognition (Zien, Ratsch, et
al., 2000)
Neural Network
Linear Kernel O with 7 œ "
Onew
15.4%
13.2%
11.9%
7. General kernel construction
Many string algorithms in comp. bio. can lead to kernels, as
long as they give similarlity scores WÐxß yÑ for sequences x
and y which translate to pos. def. kernel OÐxß yÑ.
1. Smith-Waterman score (pairwise alignment measure,
kernelized in Gorodkin, 2001) gives similarity kernel OÐxß yÑ
for multiple alignments (separation of strings with
hyperplane in string space.)
2. Kernelization of other algorithms can be done similarly:
ì Kernel ICA (independent component analysis)
algorithms
Kernel PCA
ì
(principal component
algorithms
ì Kernel logistic regression methods
analysis)
For other examples of string kernel methods in
computational biology see talks of J.P. Vert:
http://cg.ensmp.fr/~vert/talks/060921icgi/icgi.pdf