Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BioinfoGRID Symposium 2007 Mathematical methods for the analysis of the Barcode Of Life P. Bertolazzi, G. Felici Istituto di Analisi dei Sistemi e Informatica del Consiglio Nazionale delle Ricerche Optimization Laboratory for Data Mining DNA Barcoding 1/4 • DNA barcoding is a new technique that uses a short DNA sequence from a standardized and agreed-upon position in the genome as a molecular diagnostic for species-level identification. • The chosen sequence for Barcode is a small portion of mitocondrial DNA (mt-DNA) that differs by several percent, even in closely related species, and collects enough information to identify the species of an individual. • It is easy to isolate and analyze. • Moreover it resumes many properties of the entire mt-DNA sequence. DNA Barcoding 2/4 A typical animal cell. Within the cytoplasm, the major organelles and cellular structures include: (1) nucleolus (2) nucleus (3) ribosome (4) vesicle (5) rough endoplasmic reticulum (6) Golgi apparatus (7) cytoskeleton (8) smooth endoplasmic reticulum (9) mitochondria (10) vacuole (11) cytosol (12) lysosome (13) centriole. DNA Barcoding 3/4 • The first studies on barcode (2003) are due to Hebert (see [1] for the last results and a complete bibliography) • Two mt-DNA subsequences (genes) are proposed as barcode: – Cytochrome c Oxidase I (COI) – Cytochrome b • Since 2003 COI has been used by Hebert to study fishes, birds, and other species • Hebert employs the Neighbor Joining (NJ) [2] method, proposed to obtain phylogenetic trees, and identifies each species as represented by a distinct, non overlapping cluster of sequences in the NJ tree. DNA Barcoding 4/4 • Recent studies [1] show that even fragments of the COI sequence have the same expressive power than the entire sequence • The Consortium of Barcode of Life (CBOL) is an international initiative devoted to developing DNA barcoding as a global standard for the identification of biological species http://www.barcoding.si.edu/ • Data Analysis Working Group (CBOL and DIMACS, Rutgers) http://dimacs.rutgers.edu/Workshops/BarcodeResearchChallen ges2007/ http://dimacs.rutgers.edu/Workshops/DNAInitiative/ Research challenges 1/2 • Specimen identification versus species discovery : the knowledge about species is not always complete • Is a species similar to another or not? • Optimizing sample size: barcodes are not easy to measure, large samples are very expensive • Using character-based barcodes: an alternative approach to comparing specimens in terms of overall percent sequence similarity • Shrinking the barcode Research challenges 2/2 • Shrinking the barcode: we want to identify the relevant portions of the barcode wrt each species; this would make easier to identify new data on species and study it A barcode fragment for 2 species Species 1: CTGGCATAGTAGGTACTGCCCTTAGCCTCCCCCCAGTCCTCTCCC Species 2: CTGGCATAGTCGGAACCGCTCTCAGCCTACCCCCAGCCCTTTCTC • 45 sites • Difference bases on only 8 sites • In this case only 1 site is sufficient to distinguish the 2 species Main goal • Given samples from different species, the objective is to identify those combinations of muted nucleotides that have determined the differences among species in the evolution path Our work 1/2 – We have addressed these challenges using supervised learning and some special methods that we have produces for classification of logic data of large size – Such methods already proved successful in other bio-computational problems (Tag SNPs selection, microarray data analysis, genetic diagnostics)[3]: • Feature selection methods based on Integer Programming: aims at finding a subset of bases that allows to distinguish among species • Logic classification methods based on logic programming:aims at finding rules or formulas that can distinguish between individuals of different species Our Work 2/2 • Feature Selection is used to select a limited number of positions where the values of the bases differs consistently from species to species; • Logic classification methods use the selected positions to find formulas with high semantic value that can provide a deep understanding of the analyzed data. Logic classification is computationally expensive Many features Compact logic formulas Feature selection Few features Logic classification Supervised Learning: the standard setting • Given: • A finite number of objects described by a finite number of measures X = (x1, x2, …, xn) • An interesting characteristic of these objects Y (TARGET) • Determine the relation between TARGET and measures Y = F(X) • It is only required that, for each set of measures X, it produces an output Y Classification: Y = (0,1,2, …) Estimation: Y Q R • Very important – The form of the relation is decided up front; – The parameter of the relation are determined by the data; – The results of the estimation need to be validated. Supervised Learning: training and testing Data Training Set Choose model Estimate model Testing Set Performance ? ok • If data contains good information and the choice of the model is correct, the model has a good fit on the training data • Good fitting to training data may provide information and knowledge • The model should also adapt well to test data • Only good fitting to test data strengthen the knowledge extracted and validates the forecasting or classification function of the model The Special case of Logic Data 1/2 • • Frequently, objects are (or can be) described only by logic attributes (True/False) The classification function is expressed in terms of: – Logic variables p, q, r, with True/False value – conjunction (p q), disjunction (p q), negation ( p) Logic variables (property of the object) A A A I I S1 T T T T F S2 T F F T ? S3 F F F T F S4 ? T F ? T S1 S3 A A A I I S1 1 1 1 1 -1 S2 1 -1 -1 1 0 S3 -1 -1 -1 1 -1 (1, 0, -1, 0) S4 0 1 -1 0 1 T “If the object shows both the presence of property 1 and the absence of property 3, then it is class A, else is class I” The Special case of Logic Data 2/2 X1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 0 0 X2 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 X3 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 0 1 1 1 1 1 X4 0 0 1 1 0 1 1 0 0 0 1 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 0 0 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0 1 1 X5 0 1 0 0 1 0 1 1 1 1 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1 1 0 X6 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 1 1 IF [ -X1 & -X2 ] OR [ -X2 & -X3 ] OR [ -X1 & -X3 ] IF [ X2 & X6 ] OR [ X1 & X3 ] OR [ X2 & X3 ] OR [ X1 & X2 ] True for red, false for blue True for blue, false for red Mining Logic Data • • • • The Lsquare Logic Miner [4,5,6] Builds logic separations in Disjunctive Normal Form (DNF) Identifies iteratively the clauses of the DNF that separates the largest part of object in one class from all the objects of the other class Clause identification is based on the solution of a Minimum Cost Satisfiability Problem (MINSAT), computationally hard 1 pi True, qi False si 1 pi False, qi True 0 p q False i i Satisfying solution: S1 S2 S3 S4 A T T F ? A T F F T A T F F F q1 d 3 , p2 d 3 , p3 d 3 , p4 d 3 I T T T ? q1 q2 q3 q4 p4 I F ? F T p1 p2 q2 p3 q4 p1 True, q1 False p2 False, q2 False p3 False, q3 True p4 False, q4 False q1 d1 , q2 d1 , p3 d1 , p4 d1 , q4 d1 q1 d 2 , p2 d 2 , p3 d 2 , q4 d 2 s1 s3 Mining Logic Data Two MINSAT problems are solved at each iteration min cd aA s.t. min pi qi qi pi i:b 1 i:b 1 i i qi d a pi d a i 1, 2,...n Lsquare finds 4 DNF formulas: b B a A, i : ai 1 a A, i : ai 1 c' p c ' q i 1, n s.t. a i i pi qi i 1, 2,...n qi pi b B i:b 1 i:b 1 i i qi a A' , i : ai 1 pi Select Largest separable subset 2: A from B, max support 3: B from A, min support 4: B from A, max support Identify the clause with desired support a A' , i : ai 1 C ' (True) 1 c ' () C ' ( False ) 0 1: A from B, min support C ' (True) 0 c ' () C ' ( False ) 1 ALL: majority vote, use undecided The behavior of the formulas strongly interacts with the quality of the data… Feature Selection Most methods for FS are based on a greedy construction of the feature set, and do not take into proper account the interactions among the selected features (correlation, collinearity) An interesting method adopts Integer Programming (Set Covering) to select the smallest set of features that enables to differentiate logic records belonging to different classes The Set Covering Approach for FS (Rutgers - LAD approach) A binary variable is associated with each feature A linear constraint is associated with each pair of objects belonging to different classes A set covering model is used requires that each pair of objects belonging to different classes are differentiated by at least one of the selected features min j xj s.t. j a(ik ) j x j 1 (i, k ) : i A, k I x j 0,1 a(ij)k = 1 if i,k belong to different classes i,k are different on feature j a(ij)k = 1 feature j selected Feature Selection : a simplified version To overcome the untreatable dimensions of the quadratic model, we propose a simplified version of the set covering model (Linear SC), where the number of constraints is equal to the number of samples fi feature i PA(i) = proportion of elements with fi = 1 in class A; PB(i) = proportion of elements with fi = 1 in class B; xi = 1 iff feature i is chosen in the solution if PA(i) > PB(i) Coverage is given min i xi s.t. d i x j ij i xi 0,1 1 if row j is in class A and f i 1 0 if row j is in class A and f 0 i d ij 1 if row j is in class B and f i 0 0 if row j is in class B and f i 1 Rows are linear in the # of examples Redundancy can be controlled May still need heuristics for large number of features… Coverage is maximized, given the bound on the # of features max s.t. d x x j i ij i i i xi 0,1 Application to Barcode Data • • • • • • • Data Set from the 2006 conference 1623 samples belonging to 150 different species Each sample is described by 690 nucleotides (columns) We search for a compact rule for each one of the 150 classes For each species k, we solve a 2-class learning problem: • class A: all samples in class k • class B: samples of all classes different from k We use Linear set covering for feature selection and logic mining to determine the formulas on a training subset (8090%) of the available data, and then test their classification capabilities on the remaining data. Training and testing samples are drawn at random maintaining the same proportion in each class Application to Barcode Data DATA SET TRAINING SET TRAINING SET FEATURE TEST SET Integer Programming model associated with the Linear set Covering model is solved optimally with commercial solver ILOG CPLEX with 10, 20 and 30 as values for SELECTION FORMULA EXTRACTION TEST SET LSQUARE is used to separate each species from the others 149, and a compact formula explaining each specie is obtained The formulas are used to predict the specie of each element in the test set. Application to Barcode Data • • • • • • • • • LSC construction PAjk = proportion of samples in class k with nucleotide = A in column j PCjk = proportion of samples in class k with nucleotide = C in column j PGjk = proportion of samples in class k with nucleotide = G in column j PTjk = proportion of samples in class k with nucleotide = T in column j PNAjk = proportion of samples in class <>k with nucleotide = A in column j PNCjk = proportion of samples in class <> k with nucleotide = C in column j PNGjk = proportion of samples in class <> k with nucleotide = G in column j PNTjk = proportion of samples in class <> k with nucleotide = T in column j • • • • • aij = 1 iff: Sample i is in class k, nucleotide i of j = A and PAjk > 2 PNAjk Sample i is in class k, nucleotide i of j = C and PCjk > 2 PNCjk Sample i is in class k, nucleotide i of j = G and PGjk > 2 PNGjk Sample i is in class k, nucleotide i of j = T and PTjk > 2 PNTjk max j aij x j i j xj x j 0,1 Select columns with largest Use these columns to formulate a separation problem for each class with Lsquare (1 vs all) Obtain a logic formula for each class An example CLASS CLASS CLASS CLASS CLASS 1 2 C A A G 3 G A T T 4 T T A C 1 PA 1.0 PC 0.0 PG 0.0 PT 0.0 2 0.5 0.5 0.0 0.0 3 0.5 0.0 0.5 0.0 4 0.0 0.0 0.0 1.0 PA PC PG PT 1 1.0 0.0 0.0 0.0 2 1.0 0.0 0.0 0.0 3 0.0 0.0 0.0 1.0 PA PC PG PT 1 0.0 1.0 0.0 0.0 2 0.0 0.0 1.0 0.0 3 0.0 0.0 0.0 1.0 1 1 2 3 CLASS 2 CLASS 3 With = 1: separable With = 2: 1 A A A C PNA PNC PNG PNT 1 0.50 0.50 0.00 0.00 2 0.50 0.00 0.50 0.00 3 0.00 0.00 0.00 1.00 4 0.50 0.50 0.00 0.00 4 1.0 0.0 0.0 0.0 PNA PNC PNG PNT 1 0.67 0.33 0.00 0.00 2 0.33 0.33 0.33 0.00 3 0.33 0.00 0.33 0.33 4 0.00 0.33 0.00 0.67 4 0.0 1.0 0.0 0.0 PNA PNC PNG PNT 1 1.00 0.00 0.00 0.00 2 0.67 0.33 0.00 0.00 3 0.33 0.00 0.33 0.33 4 0.33 0.00 0.00 0.67 x3 = 1, x1=x2=x4=0: non x4 = 1, x1=x2=x3=0: separable x3 = x4 =1, x1=x2=0: separable IF (X4=T) THEN CLASS 1 IF (X4=A) THEN CLASS 2 IF (X4=C) THEN CLASS 3 1 1 2 3 4 2 1 1 0 1 0 1 1 1 3 1 1 1 1 4 1 1 1 1 max x1 x3 x4 x1 x2 x3 x4 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4 Results v a lue s o f For each row we solve: 1 set covering problem 150 logic classification problems {10, 20, 30} Test % {10, 20} 3 random repetitions for each setting erro r te st% tr a ining te sting 10 2 10 11.14% 11.38% 10 2 10 5.62% 8.98% 10 3 10 5.99% 7.19% 10 3 20 9.21% 10.17% 10 3 20 7.11% 9.75% 10 2 20 7.81% 8.05% 7 .8 1 % 9 .2 5 % a v e r a ge 20 6 10 2.06% 2.40% 20 6 10 0.84% 1.20% 20 6 10 0.28% 0.60% 20 6 20 1.90% 2.12% 20 6 20 0.30% 1.27% 20 6 20 0.30% 1.27% 0 .9 5 % 1 .4 8 % a v e r a ge 30 9 10 0.37% 0.60% 30 9 10 0.28% 0.60% 30 9 10 0.37% 0.60% 30 9 20 0.30% 1.69% 30 9 20 0.30% 0.85% a v e r a ge 0 .3 3 % 0 .8 7 % o v e r a ll 3 .0 3 % 3 .8 7 % Results Site 580 appears 54 times in the formulas that discriminate each class from the rest (approx. 1/3) average error on test for different values of 10.00% 9.00% Se le c te d Site 8.00% O c c u r r e nc e s in F o r m u la s 7.00% 6.00% 580 54 490 49 346 41 469 41 544 41 637 41 331 39 445 34 154 30 10 85 29 9 307 29 8 379 29 7 31 25 211 25 28 22 3 40 22 2 10 21 16 15 34 15 4 14 5.00% 4.00% 3.00% 2.00% 1.00% 0.00% 10 20 30 mode of optimal for different values of 6 5 4 1 10 20 30 Results • • • • • • • • • • • • • • • • • SPECIES 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 DIM 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 CLAUSE(S) * 274 A 499 T 580 C * 19 T 172 T * 340 G 343 A 445 C * 445 T 499 T 580 G * 172 C 445 T 493 G 499 C 580 A * 58 G 430 T * 268 C 289 A 334 C 430 T * 136 A 277 C 445 T * 58 A 121 T 172 C * 277 T 499 G * 163 T 274 A 334 G * 19 T 331 C 652 T * 4 C 10 T 274 C 289 A 445 A * 340 G 445 G * 277 A 340 A 430 T 445 C * 121 C 331 T * 58 A 283 C 331 T Results IF BASE IN POSITION 19 IS T AND BASE IN POSITION 172 IS T THEN SPECIES IS …1 IF BASE IN POSITION 340 IS G AND BASE IN POSITION 343 IS A AND BASE IN POSITION 445 IS C THEN SPECIES IS … 2 IF BASE IN POSITION 58 IS A AND BASE IN POSITION 121 IS T AND BASE IN POSITION 172 IS C OR BASE IN POSITION 277 IS T AND BASE IN POSITION 499 IS G THEN SPECIES IS … 8 Related work • Haplotype Inference by Parsimony: new very efficient heuristic • Tag SNP and SNP reconstruction problem • Phylogenetic tree in polyploid organisms Conclusions and Future work • The logic technique is very powerful for identifying small non contiguous subsequences of the barcode • We are testing the technique on a very huge set of data from Lepidoptera • We will compare our technique with NJ method [2] • CBOL and DAWG have asked us to implemented our technique as a web service • We are designing a software platform that implements algorithms for all the above problems References • • • • • • • • [1] M. Hajibabaei , G. A.C. Singer, E. L. Clare, P.D.N. Hebert Design and applicability of DNA arrays and DNA barcodes in biodiversity monitoring BMC Biology, 2007 [2] M. Saitou., M. Nei Neighbour Joining Method, Mol Biol Evol. 1987 [3] P.Bertolazzi, G. Felici, P. Festa, G. Lancia, Logic Classification and Feature Selection for Biomedical Data, Computers & Mathematics with Applications, online version (2007) [4]G. Felici, K. Truemper, A Minsat Approach for Learning in Logic Domains, INFORMS JOC, 2002; [5] K. Truemper, Design of Logic-Based Intelligent Systems, Wiley-Interscience, 2004 [6] G. Felici, K. Truemper, The Lsquare System for Mining Logic Data, Encyclopedia of Data Warehousing and Mining, 2005 [7] P. Bertolazzi. G. Felici Species Classification with Optimized Logic Formulas poster, EMBO Conference, Rome, May 2007. [8] P. Bertolazzi, G. Felici Species Classification with Optimized Logic Formulas invited talk , Second BOL Conference, Taipei, Sept. 2007