Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mechanism of obtaining unsupervised knowledge to enrich CliDaPa approach S. González1*, J. Veiga1, V. Robles1, J.M. Peña1 and F. Famili2 (1) Department of Computer Architecture and Technology, Universidad Politécnica de Madrid, Spain (2) NRC Institute for Information Technology, Ottawa, Canada e-mail:[email protected], [email protected], [email protected], [email protected], [email protected] Keywords: CliDaPa, unsupervised, clinical, DNA microarrays, data analysis Motivation and Aim: Recalling conclusions obtained from [9] when we compare traditional data uses and CliDaPa algorithm applied to two or more information sources (e. g. clinical and gene expression data), the CliDaPa approach improved results on disease classification. However, if we analyze the data used, we appreciate that the only data that is not easily understandable for expert biologists is gene expression data. If we obtain new knowledge from gene information, probably we could use it as new information source and we could improve the CliDaPa executions, improving results too. Methods and Algorithms: Sample-based clustering can be obtained using an unsupervised method, Quality Threshold [10], with gene expression data. For that, it’s necessary to define several features: the distance measure, the threshold and the minimum number of elements in a cluster. Using Euclidean, Manhattan, Pearson correlation and Biweight correlation [11] as distance measures, and using 10 different values of threshold within the range [mean – 2*deviation, mean + 2*deviation], 40 new clusters can be obtained. This new data is injected as new clinical data from the in data. Thus, when we execute the CliDaPa algorithm, where the data can be divided using any of these new data, only if the classification can be improved. Results: To validate this proposal, several data sets with clinical and gene expression data (from Van’t Veer, Van der Vivjer and Brain Cancer datasets) have been used. Several experiments have been carried out, using an external MxN fold cross validation. Results obtained from this proposal show us a 10% of improvement in the classification, if we compare with regular CliDaPa executions. Conclusions: This new approach demonstrates that new unsupervised knowledge can improve and enrich a supervised classification. Thus, applying to CliDaPa approach, this algorithm gets the fulfillment of the proposed objectives. Availability: CliDaPa and its improvements are available for the research community. For further information, please, contact the authors. Acknowledgements: The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the Centro de Supercomputación y Visualización de Madrid (CeSViMa) and the Spanish Supercomputing Network. References: 1. J. Brenton. (2005) Molecular classification and molecular forecasting of breast cancer: ready for clinical application? J. Clin. Oncol., 23:7350–7360 2. P. Larrañaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armañanzas, G. Santafé, A. Pérez and V. Robles (2006) Machine learning in bioinformatics. Briefing in Bioinformatics. 3. Scott L. Pomeroy, Pablo Tamayo, Michelle Gaasenbeek,Lisa M. Sturla, Michael Angelo, Margaret E. McLaughlin,John Y. H. Kim, Liliana C. Goumnerovak, Peter M. Blackk, Ching Lau,Jeffrey C. Allen, David ZagzagI, James M. Olson, Tom Curran,Cynthia Wetmore²², Jaclyn A. Biegel, Tomaso Poggio, Shayan Mukherjee, Ryan Rifkin, Andrea Califanokk,Gustavo Stolovitzkykk, David N. Louis, Jill P. Mesirov,Eric S. Lander & Todd R. Golub. Prediction of central nervous system embryonal tumour outcome based on gene expression. 4. M. J. van de Vijver, Y. D. He, L. J. van’t Veer, H. Dai, A. A. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards. A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med, 347(25):1999–2009, December 2002. 5. L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. Hart, M. Mao, H. L. Peterse, K. van der Koõy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Frieñd (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871):530–536, January. 6. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downning, M.A. Caligiuri, C.D. Bloomfield, E.S. Lander (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. 7. S. Paoli, G. Jurman, D. Albanese, S. Merler, and C. Furlanello, Semisupervised Proling of Gene Expressions and Clinical Data, ITC-irst - Trento, Italy 8. Nathalie L.M.M. Pochet, Frizo A.L. Janssens, Frank De Smet, Kathleen Marchal, Ignace B. Vergote, Johan A.K. Suykens and Bart L.R. De Moor, M@CBETH: Optimizing Clinical Microarray Classification, Department of Electrical Engineering ESAT-SCD, Leuven-Heverlee, Belgium 9. S. González, L. Guerra, V. Robles, J. M. Peña and F. Famili CliDaPa: A new approach to combining clinical data with DNA microarrays. IDA Journal 2009. 10. Laurie J. Heyer, Semyon Kruglyak, and Shibu Yooseph. Exploring expression data: Identification and analysis of coexpressed genes. Genome Research, 9(11):1106–1115, November 1999. 11. Johanna Hardin, Aya Mitani, Leanne Hicks, and Brian VanKoten. A robust measure of correlation between two genes on a microarray. BMC Bioinformatics, 8(1):220+, June 2007. 12. Pavel Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002. 13. Daxin Jiang, Chun Tang, and Aidong Zhang. Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16:1370–1386, 2004.