* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 1. SVM example: Computational Biology Assume a fixed species
Survey
Document related concepts
Eukaryotic transcription wikipedia , lookup
Transcription factor wikipedia , lookup
Community fingerprinting wikipedia , lookup
Gene expression profiling wikipedia , lookup
Protein moonlighting wikipedia , lookup
Molecular evolution wikipedia , lookup
RNA polymerase II holoenzyme wikipedia , lookup
Point mutation wikipedia , lookup
Gene regulatory network wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genomic library wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Gene expression wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Expression vector wikipedia , lookup
Transcript
1. SVM example: Computational Biology Assume a fixed species f (e.g. baker's yeast, s. cerevisae) has genome Z (collection of genes). Typically a transcription factor (TF) > binds to the promoter (upstream) DNA near 1 and initiates transcription. In this case we say 1 is a target of >. Question: given a fixed TF >, for which genes 1 − Z are its targets? Chemically hard to solve: Fig. 1: Left: DNA binding of GCM; right: binding of Fur (C. Ehmke, E. Pohl, EMBL, 2005) Try machine learning: Consider a training data set (genes known as targets or non-targts of >) H! œ ÖÐ13 ß C3 Ñ×83œ" ß where 13 is a sample gene and C3 œ œ " " if 13 is target . otherwise For all genes 1 define function " 0! (1Ñ œ C œ œ " if 13 is target otherwise How to learn the general function 0! À Z Ä from examples in H? Start by representing 1 as a feature vector x œ ÒB" ß B# ß á ß B. ÓX with 'useful' information about 1 and its promoter region. What is useful here? Consider EGKKX GX KKX ÞÞÞGKX œ promoter DNA sequence of 1 œ upstream region: This is where TF typically binds; has ~1000 bases Feature maps 2. Feature maps One useful choice of feature vector: Example: Consider an ordered list of possible strings of length ': Feature maps string1 AAAAAA string2 AAAAAC string3 AAAAAG string4 AAAAAT string5 AAAACA ã ã of all sequences of 6 base pairs. Feature maps Given gene 1, choose feature vector x œ ÒB" ß á ß B. Ó, where B" œ # appearances of string " in promoter of 1 B2 œ # appearances of string # in promoter of 1 etc. Consider feature map F which takes 1 to x À FÐ1Ñ œ xÞ Number of possible strings of length ' is 4' œ 4,096. Feature maps Thus . œ 4,096. Ê x œ ÒB" ß á ß B. ÓX − ‘%ß!*' ´ J Thus let J œ all possible x œ feature space (a vector space) Henceforth replace 1 by its feature vector x œ xÐ1ÑÞ To classify 1 we classify xÞ Using x œ xÐ1Ñ goal is to find Feature maps " 0 Ð xÑ œ C œ œ " if 1 is target if 1 is not target Thus: 0 maps a sequence x of string counts in J into a yes or no. Replace 13 by x3 œ xÐ13 Ñ H! œ ÖÐ13 ß C3 Ñ×3 Ä H œ ÖÐx3 ß C3 Ñ×ß Thus given examples 0 Ðx3 Ñ œ C3 for the feature vectors x3 in our sample, want to generalize and find 0 ÐxÑ for all x. Feature maps With data set H, can we find the right function 0 À J Ä „ " which generalizes the above examples, so that 0 ÐxÑ œ C for all feature vectors? Easier: find a 0 À J Ä real numbers, where 0 ÐxÑ ! if C œ "à 0 ÐxÑ ! if C œ "Þ Resulting predictive accuracy depends on the number of features used, i.e., what the components B3 of x mean. For example, B#%* can be how many times ACGGAT appears. B%ß&*) Feature maps can be how many times ACG_ _ _GAT appears (i.e., any 3 letters allowed to go in middle) Generally choose the most significant features - the most helpful ones in discrimination. For TF YIR018W, accuracy in prediction of targets vs. number of features: Feature maps Nonlinear kernels SVM: Nonlinear Feature Maps and Kernels http://www.youtube.com/watch?v=3liCbRZPrZA 1. General SVM: when O (xß yÑ is not an ordinary dot product Recall: In yeast, Z œ genome (all genes). Given fixed transcription factor >, want to determine which genes 1 − Z bind to >. Nonlinear kernels Have a feature map x À Z Ä J œ ‘. with J the feature space, with xÐ1Ñ œ x œ feature vector for gene 1. For each 1 define CÐ1Ñ œ œ " " if 1 binds . otherwise We want a map 0 ÐxÑ which classifies genes. That is, for 1 − Z ß with feature vector x œ FÐ1Ñ we want Nonlinear kernels 0 Ð xÑ œ ! Ÿ! if CÐ1Ñ œ " Þ if CÐ1Ñ œ " Have examples Ö13 ×83œ" § Z of genes with known binding, together with CÐ13 Ñ. Define x3 œ xÐ13 Ñ to be feature vectors of the examples. The SVM provides 0 of the form Nonlinear kernels 0 Ð xÑ œ w † x , ; 0 ÐxÑ ! yields conclusion C œ " (binding gene) and otherwise C œ ". Thus have linear separation of points in J . What about nonlinear separations? 1. Replace base space Z by J (i.e., replace gene by its feature vector) Nonlinear kernels Thus have collection of examples ÖÐx3 ß C3 Ñ×83œ" for of feature vectors x3 for which binding C3 is knownÞ Desire a new (possibly nonlinear) function 0 ÐxÑ which is positive when x is feature vector of binding gene and negative otherwise. 2. With J now as base space, define new feature map F À J Ä J" (now may be nonlinear but continuous). Map the collection of examples into J" Þ Thus new set of examples is Nonlinear kernels Öz3 ´ ÐFÐx3 Ñß C3 Ñ×83œ" Þ Induce a linear SVM in J" (SVM algorithm above). Nonlinear kernels 3. New decision rule: 0" ÐFÐxÑÑ ´ w" † FÐxÑ , . If 0" ÐFÐxÑÑ ! we conclude C œ " (gene binds) and otherwise C œ ". Equivalent rule on original J : 0 ÐxÑ œ 0" ÐFÐxÑÑÞ Allows arbitrary nonlinear separating surfaces on J . Kernel trick 2. The kernel trick Equivalently to above: assume w" œ FÐwÑ for some w. pNew decision rule: 0 ÐxÑ œ w" † FÐxÑ , œ FÐwÑ † FÐxÑ ,Þ Define standard linear kernel function on J : OÐxß yÑ œ FÐxÑ † FÐyÑ (as before ordinary dot product). Now back in J , can show OÐxß yÑ is a Mercer kernel: Kernel trick (a) OÐxß yÑ œ OÐyß xÑ (b) OÐxß yÑ is positive definite. Indeed, given any set Öx3 ×83œ" , OÐx3 ß x4 Ñ œ FÐx3 Ñ † FÐx4 Ñ œ u3 † u4 with u3 œ FÐx3 Ñ. We already know ordinary dot product makes pos. def. kernel. (c) O is continuous because F is cont. Kernel trick Like any kernel function OÐxß yÑ satisfies certain properties of inner product, and so can be thought of as a new dot product on J . Thus 0 ÐxÑ œ OÐwß xÑ ,Þ With the redefined dot product w † x ´ OÐwß xÑ, training SVM is identical to before - we have already developed the algorithm here - just replace old dot product by the new one. Conclusion: The introduction of nonlinear separators for SVM via replacement of x − J by a nonlinear function FÐxÑ Kernel trick is exactly equivalent to replacement of the dot product w † x by OÐwß xÑ with O a Mercer kernel! This is equivalent to replacing the standard linear kernel OÐwß xÑ œ w † x (linear) by a general nonlinear kernel, e.g., the Gaussian kernel: OÐwß xÑ œ /lwxl # Recall advantages: the calculation of w and , involve linear algebra involving the matrix O34 œ OÐx3 ß x4 ÑÞ Gaussian kernel 3. Examples Ex 1: Gaussian kernel O5 Ðxß yÑ œ / [can show pos. def. Mercer kernel] lxyl# #5 # Gaussian kernel SVM: from (4) above have 0 ÐxÑ œ "+4 OÐxß x4 Ñ , œ "+4 / 4 |xx4 l# #5 # ,ß 4 where examples x4 in J have known classifications C4 , and +4 ß , are obtained by quadratic programming. What kind of classifier is this? It depends on 5 (see Vert movie). Note Movie1 varies 5 in the Gaussian (5 œ _ corresponds to a linear SVM)à then movie2 varies the margin lw" l (in Gaussian kernel Gaussian feature space J# ) as determined by changing - or equivalently G œ #-"8 Þ 4. Software available Software which implements the quadratic programming algorithm above includes: • SVMLight: http://svmlight.joachims.org • SVMTorch: http://www.idiap.ch/learning/SVMTorch.html • LIBSVM: http://wws.csie.ntu.edu.tw/~cjlin/libsvm A Matlab package which implements most of these is Spider: http://www.kyb.mpg.de/bs/people/spider/whatisit.html Computational Biology Applications References: T. Golub et al Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science 1999. S. Ramaswamy et al Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures. PNAS 2001. B. Scholkopf, T. Tsuda, and J.P. Vert, Kernel Methods in ¨ Computational Biology. MIT Press, 2004. J.P. Vert, http://cg.ensmp.fr/~vert/talks/060921icgi/icgi.pdf http://cg.ensmp.fr/%7Evert/svn/bibli/html/biosvm.html Transcription factor binding 1. Matching genes and TF's: For yeast gene 1 − Z ß feature vector is x œ FÐ1Ñ. Examples: data set H œ eÐx3 ß C3 Ñf83œ" of known examples of feature vectors x3 (together with binding C3 œ „ ") used to form kernel matrix O34 œ OÐx3 ß x4 Ñ OÐxß yÑ œ x † y is linear kernel OÐxß yÑ œ / Transcription factor binding is gaussian kernel mxym# OÐxß yÑ œ Ðx † y "Ñ5 is polynomial kernel Can show all of these are Mercer Kernels. Form discriminant function 0 ÐxÑ œ OÐwß xÑ ,ß with wß , determined by quadratic programming algorithm using matrix O34 formed from data set H. Transcription factor binding 0 ÐxÑ ! means corresponding gene 1 with feature vector FÐ1Ñ œ x binds to transcription factor. What features are we interested in? Characterize 1 by its upstream region of about 1000 bases (reading from the &w end of DNA). Transcription factor binding Interesting feature maps: F" Ð1Ñ œ x œ vector of '-string counts Specifically: take all possible 6-strings (strings of 6 consecutive bases, e.g., EX KEEG ), index them with 3 œ " to %' œ %!*', and form vector x with B3 œ # appearances of 3>2 '-mer in upstream region of corresponding gene 1 (large space!). Or: Transcription factor binding F# Ð1Ñ œ x œ vector of microarray experiment results Specifically " B3 œ œ ! Or: if gene 1 is expressed in the 3>2 microarray experimen otherwise Transcription factor binding F$ Ð1Ñ œ vector of gene ontology appearances of 1 i.e., B3 œ 1 if 3>2 ontology term applies to gene 1. F% Ð1Ñ œ vector of melting temperatures of DNA along consecutive positions in upstream region Each feature map F5 yields a different kernel O5 Ðxß yÑ and Ð5Ñ kernel matrix O34 Þ Transcription factor binding Combination of features: integrate all information into a large vector (i.e., concatenate the vector strings F5 ÐxÑ into one: Fcomb ÐxÑ œ ÐF" ÐxÑß á ß F6 ÐxÑÑ. This is equivalent to taking a direct sum J of the corresponding feature spaces J" ß á ß J5 . How to define inner product in the large feature space J ? In the obvious way for a concatenation of vectors: Transcription factor binding Fcomb ÐxÑ † Fcomb ÐyÑ œ "F5 ÐxÑ † F5 ÐyÑ . 5 Thus the kernel corresponding to feature map Fcomb is given by Ocomb Ðxß yÑ œ Fcomb ÐxÑ † Fcomb ÐyÑ œ "F5 ÐxÑ † F5 ÐyÑ œ "O5 Ðxß yÑÞ 5 5 Thus SVM kernel Ocomb which combines all feature information is the sum of individual kernels O5 ! Transcription factor binding So addition of individual kernel matrices automatically combines their feature information. Positive predictive values (probability of correct positive prediction) for combined kernel O reaches approximately 90%. Reference: Machine learning for regulatory analysis and transcription factor target prediction in yeast (with D. Holloway, C. DeLisi), Systems and Synthetic Biology, 2007. Protein characterization 2. Application: protein sequences: JP Vert Protein characterization Jakkola, et al. (1998) developed feature space kernel methods for anlyzing and classifying protein sequences. Applications: classification of proteins into functional vs. structural classes, cellular localization of proteins, and knowledge of protein interactions. Derive kernels (equivalently, appropriate feature maps!) by starting with choices of feature spaces J . Protein characterization Choices of J : we map protein : into FÐ:Ñ − J , where FÐ:Ñ has information on: ì physical chemistry properties of protein : ì strings of amino acids in : (see DNA examples earlier) ì motifs (what standard functional portions appear in : ?Ñ ì similarity measures, local alignment measures with other standard proteins Protein characterization Additional relevant protein features FÐ:Ñ would be ì sequence length ì time series of chemical properties of amino acids in sequence, e.g., hydrophilic properties, polarity ì do transforms on these series, e.g. Autocorrelation functions !+> +>5 , with +> > running parameter Fourier transforms String map: a useful feature map Protein characterization Consider a fixed string of length 6: e.g. W5 œ T OX LHV Define 5 >2 component B5 of feature map FÐ:Ñ œ xÐ:Ñ by B5 œ # occurrences of W5 in protein : Ðspectrum kernel, Leslie, et al. 2002) or B5 œ # occurrences of W5 in : with up to Q mismatches (mismatch kernel, Leslie, et al. 2004) Protein characterization or gapped string kernels (have gaps with weights which decay with number of gaps; substring kernel, Lohdi, et al., 2002) For example, given string PMQ EWKGZ KJ WG we have spectrum of $-mers: ÖPMQ ß MQ Eß Q EWß EWKß WKGß KGZ ß GZ Kß Z KJ ß KJ Wß J WG× spectrum (string) kernel: Protein characterization OÐxß yÑ œ FÐxÑ † FÐyÑ œ "F5 ÐxÑF5 ÐyÑ 5 where x, y are feature vectors General Observations about kernel methods: (1) Above examples illustrate advantage of feature space methods: we are able to summarize amorphous-seeming information in an object 1 in a feature vector FÐ1Ñ. Protein characterization (2) After that a very important advantage is the kernel trick: we summarize all information about our sample data Öx4 × in a kernel matrix K34 œ OÐx3 ß x4 Ñ. This allows representation of very high dimesional data FÐxÑ in matrix K with size equal to number of examples (sometimes much smaller than dimension) Matrix K is all we need to find the +3 in the discriminator function 0 ÐxÑ œ "+3 OÐxß x3 Ñ , 3 Protein characterization Another approach to forming kernels: similarity kernels Start with a known collection (dictionary) of sequences H œ Ðx" ß á ß x8 ÑÞ Define a similarity measure =Ðxß yÑÞ Define a feature vector by similarities to objects in H: FÐxÑ œ Ð=Ðxß x" Ñß á ß =Ðxß x8 ÑÑÞ Protein characterization ì Known as pairwise kernels (Liao, Noble, 2003): = œ standard distance between strings ì Motif kernels (Logan, et al., 2001): =Ðxß x4 Ñ œ distance measure between string and motif (standard signature sequence) x4 Jakkola's feature map 3. Jakkola's Fisher score map Jakkola, et al. (1998) studied HMM models for protein sequences and combined with kernel methods. 1. Form a parametric family of probabilistic models T@ , e.g., HMM models of a family of protein sequences, with @ − H § ‘7 a family of parameters. 2. Find an estimate @! (e.g., Baum-Welsh, maximum likelihood from some training set) 3 . Form the feature map (Fisher score vector) on sequence Jakkola's feature map vector x F! ÐxÑ œ f@ ln T@ Ðxѹ @ œ@ ! Þ 4. In feature space define the inner product (kernel O ) by O! Ðxß yÑ œ F! ÐxÑ † ˆM!" F! ÐyÑ‰ß where M! œ I@! F! ÐxÑF! ÐxÑX ‘ Jakkola's feature map (expectation over x under @! ) is the Fisher information matrix (expectation assuming parameter @! ÑÞ Advantages of Fisher kernel: ì Fisher score shows how strongly the probability T@ ÐxÑ depends on each parameter )3 in @ ì Fisher score F@ ÐxÑ can be computed explicitly, e.g., for HMM ì Different models @3 can be trained and their kernels O3 combined Jakkola's feature map Results for correct classification of proteins in the G-protein family as subset of SCOP (nucleotide triphosphate hydrolases) superfamily: Jakkola's feature map Finding kernels 4. Finding kernels - how do we decide We can find kernels by: ì Finding a feature map F into J" which separates positive and negative examples well. Then OÐxß yÑ œ FÐxÑ † FÐyÑ. Finding kernels ì Defining kernel as a "similarity measure" OÐxß yÑ which is large when x and y are "similar", given by a positive definite function OÐxß yÑ with OÐxß xÑ œ " for all xÞ Rationale: Note here we require for all x in feature space J" À lFÐxÑl œ ÈOÐxß xÑ œ " Finding kernels So OÐxß yÑ œ FÐxÑ † FÐyÑ œ lFÐxÑllFÐyÑlcos ) œ cos )Þ where ) ´ angle between FÐxÑ and FÐyÑÞ (2) Finding kernels So if x and y are similar (by desired criterion) then OÐxß yÑ large and by (2) ) small, i.e. FÐxÑ close to FÐyÑÞ Thus similar x and y are close in the new feature space J" , and different x and y are far, i.e., can be separated. Finding kernels Finding translation initiation sites 5. Finding translation initiation sites (Zien, Ratsch, Mika, Schölkopf, et al., 2000) A translation initiation site (TIS) is a DNA location where coding for a protein starts (i.e., beginning of gene). Usually determined by codon EX K. Question: How to determine whether particular EX K is start codon for a TIS? Finding translation initiation sites Strategy: Given a potential start codon EX K at location 3 in the genome À "Þ Start by looking at 200 nucleotides (nt) around current EX K: EGX KEX KX Ká EG î EX K X EKá EX KGEGG center (1) Finding translation initiation sites 2. Use the unary bit encoding of nucleotides: œ "!!!!à G œ !"!!!à K œ !!"!!à X œ !!!"!à Unknown œ !! 3. Concatenate the unary encodings: replace each nt in (1) by unary code: FÐ3Ñ œ ïïïï "!!!! !"!!! !!!"! !!"!! á − J E This becomes feature vectorÞ G X K Finding translation initiation sites What kernel to use in this feature space? Try polynomial kernel: OÐxß yÑ œ ax † yb œ Œ!B3 C3 7 7 œ !B3" C3" † !B3# C3# † á † !B37 C37 3 3" 3# œ " B3" C3" B3# C3# á B37 C37 3" ßáß37 with 7 fixed and xß y − J . Note: 37 Finding translation initiation sites ì If 7 œ " then OÐxß yÑ œ !B3 C3 œ number of common NT (nucleotides) 3 in strings x and y ì If 7 œ # then OÐxß yÑ œ !B3 C3 B4 C4 œ total number of pairs of common 3ß4 NT Finding translation initiation sites in x and y (times 2) ì generally OÐxß yÑ œ # 7-tuples of common NT in x and y Ðtimes 7x) for fixed 7 Note: This is a good similarity measure (see previous section). 6. Better Kernel: Local matches Now define Finding translation initiation sites Q5 Ðxß yÑ œ œ " ! if B5 œ C5 otherwise Define window kernel [ for fixed 5 À [< Ðxß yÑ œ " +3 Q<3 Ðxß yÑ 5 7" 3œ5 œ cweighted # matches in window Ð< 5ß < 5Ñd7" Usefulness: measures correlations in a window of length #5 " centered at <; less noise. Finding translation initiation sites Now add up the weighted matches over center positions <: Onew Ðxß yÑ œ "[< Ðxß yÑ. R <œ" Recognition error rates for TSS recognition (Zien, Ratsch, et al., 2000) Neural Network Linear Kernel O with 7 œ " Onew 15.4% 13.2% 11.9% 7. General kernel construction Many string algorithms in comp. bio. can lead to kernels, as long as they give similarlity scores WÐxß yÑ for sequences x and y which translate to pos. def. kernel OÐxß yÑ. 1. Smith-Waterman score (pairwise alignment measure, kernelized in Gorodkin, 2001) gives similarity kernel OÐxß yÑ for multiple alignments (separation of strings with hyperplane in string space.) 2. Kernelization of other algorithms can be done similarly: ì Kernel ICA (independent component analysis) algorithms Kernel PCA ì (principal component algorithms ì Kernel logistic regression methods analysis) For other examples of string kernel methods in computational biology see talks of J.P. Vert: http://cg.ensmp.fr/~vert/talks/060921icgi/icgi.pdf