Download CliDaPa: A new approach for enriching genes expressions using

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Mechanism of obtaining unsupervised knowledge to enrich CliDaPa approach
S. González1*, J. Veiga1, V. Robles1, J.M. Peña1 and F. Famili2
(1) Department of Computer Architecture and Technology, Universidad Politécnica de Madrid, Spain
(2) NRC Institute for Information Technology, Ottawa, Canada
e-mail:[email protected], [email protected], [email protected], [email protected], [email protected]
Keywords: CliDaPa, unsupervised, clinical, DNA microarrays, data analysis
Motivation and Aim:
Recalling conclusions obtained from [9] when we compare traditional data uses and CliDaPa
algorithm applied to two or more information sources (e. g. clinical and gene expression data), the
CliDaPa approach improved results on disease classification. However, if we analyze the data used,
we appreciate that the only data that is not easily understandable for expert biologists is gene
expression data. If we obtain new knowledge from gene information, probably we could use it as
new information source and we could improve the CliDaPa executions, improving results too.
Methods and Algorithms:
Sample-based clustering can be obtained using an unsupervised method, Quality Threshold [10],
with gene expression data. For that, it’s necessary to define several features: the distance measure,
the threshold and the minimum number of elements in a cluster. Using Euclidean, Manhattan,
Pearson correlation and Biweight correlation [11] as distance measures, and using 10 different
values of threshold within the range [mean – 2*deviation, mean + 2*deviation], 40 new clusters
can be obtained. This new data is injected as new clinical data from the in data. Thus, when we
execute the CliDaPa algorithm, where the data can be divided using any of these new data, only if
the classification can be improved.
Results:
To validate this proposal, several data sets with clinical and gene expression data (from Van’t Veer,
Van der Vivjer and Brain Cancer datasets) have been used. Several experiments have been carried
out, using an external MxN fold cross validation. Results obtained from this proposal show us a
10% of improvement in the classification, if we compare with regular CliDaPa executions.
Conclusions:
This new approach demonstrates that new unsupervised knowledge can improve and enrich a
supervised classification. Thus, applying to CliDaPa approach, this algorithm gets the fulfillment of
the proposed objectives.
Availability:
CliDaPa and its improvements are available for the research community. For further information,
please, contact the authors.
Acknowledgements:
The authors thankfully acknowledge the computer resources, technical expertise and assistance
provided by the Centro de Supercomputación y Visualización de Madrid (CeSViMa) and the
Spanish Supercomputing Network.
References:
1. J. Brenton. (2005) Molecular classification and molecular forecasting of breast cancer: ready for clinical application? J.
Clin. Oncol., 23:7350–7360
2. P. Larrañaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armañanzas, G. Santafé, A. Pérez
and V. Robles (2006) Machine learning in bioinformatics. Briefing in Bioinformatics.
3. Scott L. Pomeroy, Pablo Tamayo, Michelle Gaasenbeek,Lisa M. Sturla, Michael Angelo, Margaret E.
McLaughlin,John Y. H. Kim, Liliana C. Goumnerovak, Peter M. Blackk, Ching Lau,Jeffrey C. Allen, David ZagzagI,
James M. Olson, Tom Curran,Cynthia Wetmore²², Jaclyn A. Biegel, Tomaso Poggio, Shayan Mukherjee, Ryan Rifkin,
Andrea Califanokk,Gustavo Stolovitzkykk, David N. Louis, Jill P. Mesirov,Eric S. Lander & Todd R. Golub. Prediction
of central nervous system embryonal tumour outcome based on gene expression.
4. M. J. van de Vijver, Y. D. He, L. J. van’t Veer, H. Dai, A. A. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C.
Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S.
Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards. A gene-expression signature as a predictor of survival in breast
cancer. N Engl J Med, 347(25):1999–2009, December 2002.
5. L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. Hart, M. Mao, H. L. Peterse, K. van der Koõy, M. J.
Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Frieñd
(2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871):530–536, January.
6. T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downning,
M.A. Caligiuri, C.D. Bloomfield, E.S. Lander (1999) Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Monitoring.
7. S. Paoli, G. Jurman, D. Albanese, S. Merler, and C. Furlanello, Semisupervised Proling of Gene Expressions and
Clinical Data, ITC-irst - Trento, Italy
8. Nathalie L.M.M. Pochet, Frizo A.L. Janssens, Frank De Smet, Kathleen Marchal, Ignace B. Vergote, Johan A.K.
Suykens and Bart L.R. De Moor, M@CBETH: Optimizing Clinical Microarray Classification, Department of Electrical
Engineering ESAT-SCD, Leuven-Heverlee, Belgium
9. S. González, L. Guerra, V. Robles, J. M. Peña and F. Famili CliDaPa: A new approach to combining clinical data with
DNA microarrays. IDA Journal 2009.
10. Laurie J. Heyer, Semyon Kruglyak, and Shibu Yooseph. Exploring expression data: Identification and analysis of
coexpressed genes. Genome Research, 9(11):1106–1115, November 1999.
11. Johanna Hardin, Aya Mitani, Leanne Hicks, and Brian VanKoten. A robust measure of correlation between two genes
on a microarray. BMC Bioinformatics, 8(1):220+, June 2007.
12. Pavel Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
13. Daxin Jiang, Chun Tang, and Aidong Zhang. Cluster analysis for gene expression data: A survey. IEEE Transactions
on Knowledge and Data Engineering, 16:1370–1386, 2004.