Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Identification of Cancer-Causing Mutations in the Human Genome with Machine Learning Techniques U, Man Chon (Kevin) Computer Science Department The University of Georgia [email protected] Introduction Discussion Cancer is a leading cause of death worldwide and the total number of cases globally is increasing. The number of global cancer deaths is projected to increase by 45% from 2007 to 2030 (from 7.9 million to 11.5 million deaths). In most developed countries, cancer is the second largest cause of death after cardiovascular disease. Therefore, research to improve our understanding of the causes of cancer and its most promising therapies is urgently needed. Our experimental results demonstrate that by utilizing machine learning techniques, we can indentify the cancer-causing mutations in human genome with very high accuracy. The little variance in the accuracies of the different machine learning algorithms suggests that our new features are very useful in terms of playing a significant role in the identification process. Furthermore, when we limited our experiments to the Kinase domain, the accuracy of classification reached 90.1757%, and by giving the experimentally confirmed drivers (cancercausing) list, we were able to successfully identify the mutations with 98.549% accuracy without having any attribute selection or instance selection methods implemented. Figure 1. Single Nucleotide Polymorphisms 100.00% 90.00% Background •Single nucleotide polymorphisms (SNPs): DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered, as Figure 1 illustrates. •Mutations: Driver mutations are responsible for oncogenicity. Passenger mutations are harmless mutations. •Hypothesis: Subsets of the non-synonymous SNPs (nsSNPs) will help identify the multiple genes associated with complex ailments such as cancer. •Motivation: Finding those nsSNPs is extremely expensive, and time-consuming. •Solution: The aim of this research is to use machine learning techniques to identify probable cancer causing nsSNPs Ultimate Goal To identify suspicious mutations that we can assert with a high degree of certainty to be driver mutations and build a sophisticated model for this process. 80.00% Exp 1. Not limited to Kinase Domain 70.00% 60.00% 50.00% Exp. 2: Limited to Kinase Domain 40.00% 30.00% 20.00% Exp. 3: Limited to Kinase Domain & Experimentally Confirmed Drivers List 10.00% 0.00% Contributions • Applied different machine learning techniques to identify cancer-causing mutations. Figure 2. Visualization of Classification Results • New features are introduced. Table I. Classification Results Algorithms Exp. 1 Exp. 2 Exp. 3 J48 (Tree) Random Forest Best First Tree Functional Tree Decision Table DTNB LWL(J48+KNN) Bayes Net Naïve Bayes SVM Neural Network 86.8474 % 83.4522 % 85.1047 % 83.0243 % 83.1098 % 85.5920 % 86.5906 % 81.9686 % 80.2568 % 83.8516 % 76.0057 % 90.1757 % 85.4633 % 89.5367 % 87.3003 % 85.7029 % 89.2971 % 90.1757 % 86.9808 % 84.8243 % 88.9776 % 81.9489 % 98.5490% 95.8888% 97.3398% 97.5816% 92.6239% 97.8235% 98.5490% 96.0097% 93.8331% 97.0979% 97.5816% • Provide evidence to biologists for inventing new therapies for cancer treatment. Acknowledgments I would like to thank Dr. Khaled Rasheed and Dr. Natarajan Kannan for their guidance in this project. I would also like to thank Eric Talevich for help in collecting the data.