* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Title goes here - VideoLectures.NET
Survey
Document related concepts
Gene regulatory network wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Signal transduction wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
Biochemical cascade wikipedia , lookup
Interactome wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Magnesium transporter wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Proteolysis wikipedia , lookup
Transcript
The Challenge of Predicting Gene Function Ross D. King Department of Computer Science University of Wales, Aberystwyth Gene Function Prediction The most important revelation from the sequenced genomes is that the functions of typically only between 60-70% of the predicted genes are known with any confidence. The new science of functional genomics is dedicated to determining the function of the genes of unassigned function, and to further detailing the function of genes with purported function Data Mining Prediction We have developed a method for predicting the functional class of gene products based on ILP/Relational data mining. The idea is to learn a reliable predictive function on the examples of genes with products of known function. Then apply this function to genes where the functional class is unknown. We call this approach: Data Mining Prediction (DMP). Predicting Gene Function in Yeast We will demonstrate our approach using ORFs in yeast (Saccharomyces cerevisiae). Using the MIPS functional classification scheme ● For those ORFs whose function is currently unknown ● Using 5 types of data: 1. Sequence statistics 2. Homology (sequence similarity) 3. Predicted Secondary Structure 4. Expression (microarray) 5. Phenotype ● We want to map from sequence to function class Sequence 1 Sequence 2 Sequence 3 Sequence 4 Function Class 1 Function Class 2 Classification Schemes 1 MIPS/GeneOntology 1,0,0,0 "METABOLISM" 2,0,0,0 "ENERGY" 3,0,0,0 "CELL CYCLE AND DNA PROCESSING" 4,0,0,0 "TRANSCRIPTION" 5,0,0,0 "PROTEIN SYNTHESIS" 6,0,0,0 "PROTEIN FATE (folding, modification, destination)" 8,0,0,0 "CELLULAR TRANSPORT AND TRANSPORT MECHANISMS" 10,0,0,0 "CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION MECHANISM" 11,0,0,0 "CELL RESCUE, DEFENSE AND VIRULENCE" 13,0,0,0 "REGULATION OF / INTERACTION WITH CELLULAR ENVIRONMENT" 14,0,0,0 "CELL FATE" 29,0,0,0 "TRANSPOSABLE ELEMENTS, VIRAL AND PLASMID PROTEINS" 30,0,0,0 "CONTROL OF CELLULAR ORGANIZATION" 40,0,0,0 "SUBCELLULAR LOCALISATION" 62,0,0,0 "PROTEIN ACTIVITY REGULATION" 63,0,0,0 "PROTEIN WITH BINDING FUNCTION OR COFACTOR REQUIREMENT " 67,0,0,0 "TRANSPORT FACILITATION" 98,0,0,0 "CLASSIFICATION NOT YET CLEAR-CUT" 99,0,0,0 "UNCLASSIFIED PROTEINS" Classification Schemes 2 Hierarchy of classes 1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid metabolism" 1,2,0,0 "nitrogen and sulfur metabolism" 1,3,0,0 "nucleotide metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0 "C-compound and carbohydrate metabolism" 1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism" 1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups" 1,20,0,0 "secondary metabolism" Classification schemes 3 Hierarchy of classes 1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid metabolism" 1,1,1,0 "amino acid biosynthesis" 1,1,4,0 "regulation of amino acid metabolism" 1,1,7,0 "amino acid transport" 1,1,10,0 "amino acid degradation (catabolism)" 1,1,99,0 "other amino acid metabolism activities" 1,2,0,0 "nitrogen and sulfur metabolism" 1,3,0,0 "nucleotide metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0 "C-compound and carbohydrate metabolism" 1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism" 1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups" 1,20,0,0 "secondary metabolism" ... and ORFs may have multiple functions too! Sequence Data field aa_rat_X seq_len aa_rat_pair_X_Y mol_wt theo_pI atomic_comp_X aliphatic_index hydro strand position cai motifs tmSpans chromosome description % of amino acid X in the protein length of the protein sequence % of the amino acids X and Y consecutively molecular weight of the protein theoretical pI (isoelectric point) atomic composition of X (C,H,N,O,S) aliphatic index grand average of hydropathy the DNA strand the number of exons (no. of start positions) codon adaptation index number of PROSITE motifs number of transmembrane spans chromosome number 478 attributes in total type real int real int real real real real 'w' or 'c' int real int int 1..16,mit Homology data YAL001C: mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdk.... PSI-BLAST gene tfc sfc3 wsv442 cg9463 f1l3 organism baker's yeast fission yeast white spot virus fruit fly Arabidopsis Sequence database NRDB score 0.0 1.0e-18 2.1 2.9 3.0 sfc3: keyword(membrane) length(358) dbref(prosite) dbref(embl) We look up the associated information from SwissProt Predicted Secondary Structure Data mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdkkvk... cbbbbccaaaaaaaaaaaacccccbbbbaaaaaacccbbccccccb... We record length and relative positions of the secondary structure elements. This is relational data. Expression Data • • Microrarray experiments to measure expression changes in yeast under a variety of conditions, including cell cycle, heat shock, diauxic shift. Short time series data, numerical-valued a0 a7 a 14 a 21 YBR166C 0.33 -0.17 0.04 -0.07 YOR357C -0.64 -0.38 0.32 -0.29 YLR292C -0.23 0.19 0.36 0.14 YGL112C et -0.69 -0.89 -0.74 -0.56 Spellman al (1998), Roth et al (1998) ... DeRisi et al (1997), Eisen et al (1998) Gasch et al (2000, 2001), Chu et al (1998) Phenotype Data • • • • • Data from knockout gene growth experiments Many missing data 69 attributes x 1461 ORFs of known function 991 genes of unknown function Data taken from 3 sources (TRIPLES, MIPS, EUROFAN) deleted ORF ORF YAL001C YAL019W YAL021C YAL029C growth medium calcofluor white w n n n sorbitol n s n w benomyl n w n w H2O2 ... w w n r s = sensitive (less growth) w = wild-type (no observable effect) r = resistant (more growth) n = no data What are the Machine Learning Issues? • • • • • • • Large volume of data Missing data Accurate results required Intelligible results required Class hierarchy Multiple labels Relational data Relational vs Propositional Propositional: single table, fixed number of columns/attributes orf yal001c yal002w yal003w yal004c time0 0.34 0.76 0.77 0.38 time7 0.52 0.82 0.46 0.50 time14 0.48 0.89 0.78 0.49 Relational: multiple tables, multiple values orf yal001c yal001c yal002w yal002w SwissProtID p03415 p08640 p32583 p08775 e-val 2e-4 8e-58 6e-52 3e-42 SwissProtID p03415 p03415 p03415 p08640 keyword apoptosis repeat zinc membrane Data Mining Prediction (DMP) Entire database Test data 1/3 2/3 PolyFARM Data for rule creation 2/3 1/3 Training data C4.5 Rule generation Validation data All rules Select best rules Best rules Measure rule accuracy Results Warmr Warmr is an ILP Algorithm Developed by Dehaspe et al. It is an ILP version of the well known Apriori data mining algorithm. Designed to find frequent patterns in a datalog database. PolyFARM • • • • • First-order association rule mining Finding all frequent first order patterns in the data Distributed on a Beowulf cluster 47,034 homology patterns, f > 5% 19,628 structure patterns, f > 2% [Clare & King PADL 2003] A close homology to a short protein in E. coli hom(SPID, close) ^ sq_len(SPID, short) ^ classification(SPID, ecoli) Contains alphacoil-alpha with a high overall coil distribution struc(Pos1, a) ^ neighbour(Pos1, Pos2, c) ^ neighbour(Pos2, Pos3, a) ^ coil_dist(high) Propositionalisation Transforming relational data into boolean attributes patt1 YAL001C YAL002W YAL003W YAL004W YAL005C ... patt2 patt3 patt4 ... patt47034 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 ... ... ... ... ... 1 1 0 1 1 Dichotomic Search 1 As an alternative to the WARMR data-mining approach, we developed a frequent pattern finding method based on dichotomic search. This approach uses domain-specific logics as intermediates between propositional logic and predicate logic. Dichotomic Search 2 Most existing algorithms traverse the search space in either a top-down or a bottom-up fashion. We propose a new approach based on dichotomic search which explores the search space in both direction, allowing larger steps Dichotomic search combines completeness (w.r.t. concepts), non-redundancy, and flexibility. Ferre, S. & King, R.D. (2005). Fundamenta Informaticae Data Mining Prediction (DMP) Entire database Test data 1/3 2/3 PolyFARM Data for rule creation 2/3 1/3 Training data C4.5 Rule generation Validation data All rules Select best rules Best rules Measure rule accuracy Results C4.5 aa_ratio_pair_p_y Open source decision tree algorithm •propositional learning >0.232 <=0.232 •commonly used metabolis strand •produces interpretable rules m w •reliable c •fast transcriptio aa_rat_a •accurate n Made modifications for: •multiple labels •hierarchical labels [Clare & King Bioinformatics 2002] <=6.4 cell fate >6.4 transport Data Mining Prediction (DMP) Entire database Test data 1/3 2/3 PolyFARM Data for rule creation 2/3 1/3 Training data C4.5 Rule generation Validation data All rules Select best rules Best rules Measure rule accuracy Results Results Many rules from each data type Rules at each level of hierarchy Some classes are much easier to predict than others (for example "protein synthesis" at 71-93%, "energy" at 20-47%) Good levels of accuracy on held out test data Many predictions for ORFs of unknown function (some function at some level is predicted for 96% of the ORFs of unknown function) Some rules explainable by biology -> scientific knowledge discovery Clare & King (2003) Bioinformatics suppl. 2., 42-49 Accuracy Table Level Datatype 1 2 3 4 all Seq 55 55 33 0 71 Struc 49 43 0 0 58 Hom 65 38 69 20 55 Expr 42 37 35 0 75 Phen 75 40 7 0 68 Expression Data Rule If in the micro-array experiment (sorbitol incubation) the ORF expression is > -0.25 and in the micro-array experiment (nitrogen depletion) the ORF expression is <= -1.29 and in the micro-array experiment (YPD stationary phase) the ORF expression is > 1.06 then the function of this ORF is ”pheromone response, mating type determination, sex-specific proteins" Accuracy on training data: 11/12 (92%) Accuracy on the test data: 3/4 (75%) 21 predictions made Structure Rule If true: coil (of length 3) followed by alpha (10 <= length < 14) and true: coil (of length 1 or 2) followed by alpha (10 <= length < 14) and true: coil (of length 3) followed by alpha (3 <= length < 6) and false: coil followed by beta followed by coil (c-b-c) and false: coil (6 <= length < 10) followed by alpha (of length 1 or 2) then the function of this ORF is "mitochondrial transport" • • • • 80% accurate on test data Most matching ORFs belong to the Mitochondrial Carrier Family These have 6 long transmembrane alpha-helices of about 20-30 amino acids Why do we notice alpha-helices of length 10-14? Alignment YJL133W -------NEYNPLIHCLC----GSISGSTCAAITTPLDCIKTVLQIRG------------ 251 YKR052C -------NSYNPLIHCLC----GGISGATCAALTTPLDCIKTVLQVRG------------ 241 YIL006W ----NNTNSINLQRLIMA----SSVSKMIASAVTYPHEILRTRMQLKS------------ 310 YBR104W ----LTRNEIPPWKLCLF----GAFSGTMLWLTVYPLDVVKSIIQNDD------------ 271 YGR096W ----KTTAAHKKWELATLNHSAGTIGGVIAKIITFPLETIRRRMQFMNSKHLEK------ 250 YJR095W -----QMDVLPSWETSCI----GLISGAIGPFSNAPLDTIKTRLQKDK------------ 246 YKL120W -----LMKDGPALHLTAS-----TISGLGVAVVMNPWDVILTRIYNQK------------ 261 YLR348C -----FDASKNYTHLTAS-----LLAGLVATTVCSPADVMKTRIMNGS------------ 239 YMR166C ----DGRDGELSIPNEILT---GACAGGLAGIITTPMDVVKTRVQTQQPPSQSNKSYSVT 300 YDL198C ------DYSQATWSQNFIS---SIVGACSSLIVSAPLDVIKTRIQNRN------------ 242 YGR257C ----RFASKDANWVHFINSFASGCISGMIAAICTHPFDVGKTRWQISMMN---------- 302 YDL119C FIHYNPEGGFTTYTSTTVNTTSAVLSASLATTVTAPFDTIKTRMQLEP------------ 255 YJL133W SQTVSLEIMRKADTFSKAASAIYQVYGWKGFWRGWKPRIVANMPATAISWTAYECAKHF 310 YKR052C -SETVSIEIMKDANTFGRASRAILEVHGWKGFWRGLKPRIVANIPATAISWTAYECAKHF 300 YIL006W -DIPDSIQRR-----LFPLIKATYAQEGLKGFYSGFTTNLVRTIPASAITLVSFEYFRNR 364 YBR104W -LRKPKYKNS-----ISYVAKTIYAKEGIRAFFKGFGPTMVRSAPVNGATFLTFELVMRF 325 YGR096W FSRHSSVYGSYKGYGFARIGLQILKQEGVSSLYRGILVALSKTIPTTFVSFWGYETAIHY 310 YJR095W ---SISLEKQSGMKKIITIGAQLLKEEGFRALYKGITPRVMRVAPGQAVTFTVYEYVREH 303 YKL120W ----GDLYKG-----PIDCLVKTVRIEGVTALYKGFAAQVFRIAPHTIMCLTFMEQTMKL 312 YLR348C ----GDHQP------ALKILADAVRKEGPSFMFRGWLPSFTRLGPFTMLIFFAIEQLKKH 289 YMR166C Alignment YJL133W -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 251 YKR052C -------cccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 241 YIL006W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 310 YBR104W ----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc------------ 271 YGR096W ----cccccccccccccbaaaaaaaaaaaaaaacccaaaaaaaaaacccccccc------ 250 YJR095W -----cccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaccc------------ 246 YKL120W -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 261 YLR348C -----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc------------ 239 YMR166C ----cccccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacccccccccccccc 300 YDL198C ------cccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacc------------ 242 YGR257C ----ccccccccccccaaaaaaaaaaaaaaaaacccaaaaaaaaaacccc---------- 302 YDL119C ccccccccccccccaaaaaaaaaaaaaaaaaaacccaaaaaaaaaacc------------ 255 YJL133W -ccccccccccccccaaaaaaaaaaaccccaaaaccaaaaaaacaaaaaaaaaaaaaaaa 310 YKR052C -ccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 300 YIL006W -ccccccccc-----aaaaaaaaaaaccccaaacccaaaaaaaccaaaaaaaaaaaaaaa 364 YBR104W -ccccccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 325 YGR096W cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 310 YJR095W ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 303 YKL120W ----cccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 312 YLR348C ----ccccc------aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 289 YMR166C cccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 360 YDL198C ---cccccca------aaaaaaaaaacccaaaaacccaaaaaaaaaaaaaaaaaaaaaaa 293 YGR257C ---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 359 YDL119C ----ccccca------aaaaaaaaaacccaaaaacccaaaaaaccaaaaaaaaaaaaaaa 305 Homology rule If the ORF is not weakly homologous to a protein in klebsiella and is strongly homologous to a protein in desulfurococcales and is strongly homologous to a short protein in cyprinidae then the function of this ORF is "Protein fate (folding, modification, destination)" • • • This rule is 100% accurate on test data Almost all matching ORFs are from the 20S proteasome subunit for degradation of proteins These subunits exist in archaea and eukaryotes, but only in one specific branch of bacteria (actinomycetes). Homology rule If the ORF is not weakly homologous to a protein in klebsiella and is strongly homologous to a protein in desulfurococcales and is strongly homologous to a short protein in cyprinidae then the function of this ORF is "Protein fate (folding, modification, destination)" • • • This rule is 100% accurate on test data Almost all matching ORFs are from the 20S proteasome subunit for degradation of proteins These subunits exist in archaea and eukaryotes, but only in one specific branch of bacteria (actinomycetes). Application of DMP to Bacterial Genomes Successful for both M. tuberculosis and E. coli. Of the ORFs with no assigned function >40% were predicted to have a function at one or more levels of the class hierarchy. It was found that many of the predictive rules were more general than possible using sequence homology. References King et al. (2000) KDD 2000 King et al. (2000) Yeast (Comparative and Functional Genomics) King et al. (2001) Bioinformatics Example Rule (level 2 E. coli) If the ORF is not predicted to have a b-strand of length 3 a homologous protein from class Chytridiomycetes was found Then its functional class is “Cell processes, Transport/binding proteins” 12/13 (86%) correct on Test Set - probability of this result occurring by chance is estimated at 4x10-7. 24 ORFs of unknown function are predicted by the rule. 16 ORFs now with putative or confirmed function - 93.8% accurate predictions Experimental Conformation The original bacterial ORF predictions were made over three years ago. In the intervening time many more ORFs have been sequenced, making traditional homologous prediction methods more accurate and sensitive, and the function of some ORFs have been determined by wet biology. The E. coli genome has been re-annotated by Monica Riley’s group. “Wet” Biology conformation A number of predictions have been confirmed or falsified by new “wet” experimental data. This new data is biased towards hard classes. Despite this the results are still good: – Level 2: 23 predictions - 47.8% accuracy – Level 3: 23 predictions - 43.4% accuracy This is very much better than random as there are many classes. Confirmation of “Wet” Predictions ORF Rule Predicted Class Confirmed Function Result b0805 b1519 b1533 b1981 8 15 43 42 Cell envelop Degradation of small molecules Transport/binding proteins Transport/binding proteins C C C C b1981 56 Transport/binding proteins b2210 b2392 b2392 b2392 b2924 15 43a 43b 54 45 Degradation of small molecules Transport/binding proteins Transport/binding proteins Transport/binding proteins Transport/binding proteins b3839 b0103 b0103 b0103 b1822 b2530 b2392 b2889 b3222 b3223 b3337 b3338 b3569 b3955 43 42 41 43 15 35 14 50 54 39 28 39 32 8 Transport/binding proteins Transport/binding proteins Transport/binding proteins Transport/binding proteins Degradation of small molecules Global regulatory functions Degradation of small molecules Energy metabolism carbon Transport/binding proteins Ribosome constituents Laterally acquired elements Ribosome constituents Laterally acquired elements Cell envelop b3955 18 Energy metabolism carbon b3955 20 Energy metabolism carbon Outer membrane protein Trans-aconitate methyltransferase Cysteine pathway metabolite transport Shikimate and dehydroshikimate transport protein Shikimate and dehydroshikimate transport protein Malate:quinone oxidoreductase High-affinity manganese transporter High-affinity manganese transporter High-affinity manganese transporter Component of the MscS mechanosensitive channel – “new gene family” Essential component of translocase dephospho-CoA kinase dephospho-CoA kinase dephospho-CoA kinase 23S rRNA m1G745 methyltransferase cysteine desulfurase High-affinity manganese transporter Isopentenyl diphosphate isomerase ManNAc kinase ManNAc epimerase regulatory or redox component Periplasmic endochitinase transcriptional regulator of xylose utilization Required for invasion of brain microvascular endothelial cells Required for invasion of brain microvascular endothelial cells Required for invasion of brain microvascular endothelial cells C C C C C C C W W W W W W W W W W W W EF EA EA Extension to Arabidopsis Genome Collaborative project with the Institute of Grassland and Environmental Research and the University of Nottingham. Large increase in data: 6,000 (yeast) -> 25,000 ORFs. Large amount of micro-array data from the Nottingham Arabidopsis stock centre. The increase in data is a challenge to our machine learning algorithms, 100s MBs. Clare, A., Karwath, A., Ougham, H. and King, RD (2006) Functional Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136; Results Accuracy comparable to yeast and bacteria Large fraction of genes of currently unknown function are predicted. Some rules could be interpreted in terms of known biology Clare, A., Karwath, A., Ougham, H. and King, RD (2006) Functional Bioinformatics for Arabidopsis thaliana. Bioinformatics 2006 22: 1130-1136; Gibberellin Biosynthesis Prediction Gibberellin is an important plant hormone. Chosen because of interesting phenotypes – often extreme size. Insertion of a promoter to overproduce gene product. Result – 2 days earlier flowering – Average leaf number and weight increased at 21 days. This phenotype is consistent with prediction. Leaf number increases more rapidly in the mutant (yellow bars) than in wildtype Landsberg erecta (blue bars) Number of leaves Leaf number 18.00 16.00 14.00 12.00 10.00 8.00 6.00 4.00 2.00 0.00 21 24 28 days after sowing 31 34 Paclobutrazol (P) (inhibitor of gibberllin) abolishes the difference between mutant (M) and wildtype (L) C = control Average Leaf number at 21 days Expt 4 8.0 6.0 Days LC MC 4.0 LP MP 2.0 LC MC LP 0.0 1 Treatment MP Availability All predictions available at http://www.genepredictions.org All rules and data available at http://www.aber.ac.uk/compsci/Research/bio/dss/ ILP 2005 Challenge 1 Yeast function prediction data used as a community challenge: http://www.protein-logic.com/ The intention of the challenge was to provide a realworld data set to test of how far we have progressed in the field of ILP and multi-relational data mining. The questions we wanted to answer were: Are the tools up to the job? Do they scale? Do they handle noisy, sparse and complex data? ILP 2005 Challenge 2 A. J. Knobbe, E. K. Y. Ho, R. Malik: ILP CHallenge 2005: The Safarii MRDM environment. C. Perlich: Approaching the ILP 2005 challenge: ClassConditional Bayesian Propositionalization for Genetic Classification. J. Struyf, C. Vens, T. Croonenborghs, S. Dzeroski, H. Blockeel: Applying Predictive Clustering Trees to the Inductive Logic Programming 2005 Challenge Data. F. Riguzzi: A Simple Approach to a Multi-Label Classification Problem. Propositional Approach Zafer Barutcuoglu, Robert E. Schapire and Olga G. Troyanskaya. Hierarchical multi-label prediction of gene function. Bioinformatics (in press) Hierarchy of SVMs. Uses a Bayesian net to combine predictions. Conclusions • Data mining and machine learning are powerful tools for functional genomics. • The DMP method can be successfully applied to different genomes (bacterial, yeast, Arabidopsis) to predict gene functional class. • Micro-array data is a useful component in DMP. • Biological insight can be extracted from DMP rules. • The structure of gene prediction problems makes them an exciting test bed for machine learning methods. Acknowledgements Amanda Clare Aberystwyth Andreas Karwath Freiburg (Aberystwyth) Luc Dehaspe PharmaDM Helen Ougham IGER BBSRC The Need for Logic to Represent Scientific Knowledge Logic is the best understood way to represent knowledge. Traditional statistics, machine learning, and data mining are based on propositional logic. For some problems we require a richer description language, i.e. first-order predicate calculus. Using logic programming (predicate calculus) we can incorporate deduction, abduction, and induction. Inductive Logic Programming Inductive Logic Programming (ILP) uses logic programs (first-order predicate calculus) to learn with: describe examples, theories, and background knowledge. For certain types of problem ILP is a powerful data analysis technique - more accurate, and more comprehensible results than conventional methods. Has been successfully applied to a number of biological/chemical problems. ILP for Science The key advantage of ILP for scientific applications is that it allows the application of compact relational representations that are natural for scientists to use. This allows domain understandable rules to be automatically formed. This advantage comes at a computational cost. However, non-technical reasons are probably the greatest barrier to adoption of ILP. For example, it is very difficult to explain the benefits of ILP to domain experts. Prediction of Lethality Instead of using microarray-data to prediction the functional class of a gene we have been using the same approach to predict whether a gene knock-out will be lethal (grown in a rich medium). If false: the function of the ORF is cell cycle and true: the function of the ORF is rRNA transcription and in the micro-array experiment (cell cycle) the ORF expression is > -0.79 then the knockout is lethal. Example Rule: Test accuracy 82% (Default 21%). Summary Results Using voting (2 or more rules agree on a prediction) – Level 2 :128 ORFs predicted - 87.5% accuracy – Level 3 : 23 ORFs predicted - 91.3% accuracy All predictions – Level 2 :335 ORFs predicted - 64.5% accuracy – Level 3: 204 ORFs predicted - 44.6% accuracy