Download Early Diagnosis of Lung Cancer using a Mining Tool

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) -Special Issue ISSN 2278-6856 National Conference on Architecture, Software systems and Green computing-2013(NCASG2013) Early Diagnosis of Lung Cancer using a Mining Tool Juliet R Rajan1, Jefrin J Prakash2 1 Assistant Professor, Department of Computer Science and Engineering, Jerusalem College of Engineering, Pallikaranai, Chennai, Tamil Nadu, India [email protected] 2 Project Manager, Infosys Limited, Chennai, Tamil Nadu, India [email protected] Abstract- As the amount of data is growing day by day, there is a high requirement to extract knowledge from the data. Data mining has contributed much to this requirement thus finding its applications in diverse fields such as stock market, banking, information technology and medicine. Data mining is a process of sifting through the data and extracting the underlying pattern beneath it. With the growth in population and disease, there is need to include data mining in the field of health care industry. Studies have shown that cancer is one of the widespread diseases leading to fatal death today. Among them, lung cancer and breast cancer accounts the most. It has been found that if the disease is being diagnosed at an early stage, the survival rate of the patient could be improved but most of the time the disease is being diagnosed at a later stage. This paper proposes a methodology using a data mining which could predict the lung cancer at an early stage thereby increasing the survival rate of the patient by five years. The tool works intelligently in pre-diagnosing lung cancer based at Stage 1. This tool is constructed by making use of Artificial Neural Networks. Index Terms—Artificial Intelligence, Biomarkers, Clinical diagnosis, Data mining, Expert System, Pattern analysis 1. INTRODUCTION Lung cancer is the leading cause of death in both men and women. The disease is characterized by the uncontrolled growth of cells. If it is not diagnosed and treated early, the tissues can be metastasized to other parts of the body such as the brain, bone, liver and adrenal gland. As per the CancerCare, widely accepted tool for early lung cancer detection is not yet available. Current techniques like the chest X-Ray, Computed Tomography (CT) scan, sputum cytology, biopsy, bronchoscopy, needle aspirations, electronic nose[1] and others, not only require high infrastructure and high cost but they are proved to be efficient only in stage 4, when the tumor has metastasized to other parts of the body. Also, it has been found that 0.4% of current cancers in US are due to the CT scans performed in the past and this may increase to as high as 1.5-2% as per the 2007 report [2]. The ionizing radiation emitted by the CT scan has the capability to damage the DNA which cannot be corrected by the cellular repair mechanism. Biomarker detection can also help in the lung cancer detection but lung cancer does not have any specific biomarkers and researchers are still working on that [3]. In spite of the available existing techniques, most of the time lung cancer is detected only after crossing stage 1. As the volume of data is growing proportionally with the increase in population, there is a greater need to extract the knowledge from the data. Data mining contributes much towards this and finds its application in various diverse fields including the healthcare industry. Lung cancer being a disease which is highly dependent on previous data can make use of data mining for its early detection. Data mining tool has been proved to be successful in disease diagnosis [4]. Data mining has already started to find its application in the diagnosis of cancer such as cancer lesion detection [5], pulmonary nodule detection [6], and classification of cancer stage from tree-text histology report [7], breathe biomarker detection [8] and so on. Most of the diagnoses employed so far are based on imaging mining. In this paper, we propose a data mining method which operates on the patients’ causes and symptoms rather than the images or the biomarkers. The expert system that we define here takes the various causes and symptoms of the patients as the input parameters and performs the classification for the positive or negative category of lung cancer based on Artificial Neural Networks (ANN). 2. DATA MINING STEPS Data mining is a process of finding and extracting hidden pattern of correlation among the data which cannot be found by the normal statistical method. It is an iterative and interactive process. For the successful extraction of pattern, step by step procedure has to be followed. 2.1 Data Integration The reports of the patients suffering from lung cancer are collected from various sources and integrated. Heterogeneous reports from different health care centers tend to give better result after the mining process. ISBN NO: 978-93-80609-14-0 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) -Special Issue ISSN 2278-6856 National Conference on Architecture, Software systems and Green computing-2013(NCASG2013) 2.2 Data Cleaning The various causes and symptoms relevant to the mining process are retrieved from the heterogeneous reports thus generating the dataset required for the learning and testing .The dataset thus generated has a greater probability of containing missing information, erroneous data, noise or inconsistent data. Based on the domain knowledge, missing value for an attribute is filled. We ignore those records which has more than 40% of its value missing. In future, we can make use of SOM based fuzzy map model for data mining with incomplete dataset [9]. Table 1 presents some of the causes identified. Table 1 Some Lung Cancer Causes Attributes Attribute Type Age Numeric Gender Nominal Height Numeric Weight Numeric Smoking habit Nominal Secondhand smoke Nominal Radon gas Nominal Asbestos Nominal Air pollution Nominal Radiation therapy to lungs Nominal HIV or AIDS Nominal Organ Transplant Nominal Women with HRT Nominal Symptoms of the patients are classified as primary and secondary symptoms. Table 2 and Table 3 present some of the primary and secondary symptoms identified. 2.3 Data Transformation The attributes identified has to be transformed into form that is understandable by both human and the machine. Some of the parameters like age, height, weight are normalized for computational efficiency by using the following formula: (1) The attributes with nominal values are then converted into numeric or discrete variables. After the normalization and the discretization process, the records of the patients are represented in the form of a matrix. (2) Where p is the total number of training data and n is the number of attributes identified. The dataset is then divided into 2 parts such that 80% of the data are used for the learning purpose and the remaining 20% of the data are used for the testing purpose. Fig 1 shows the general structure of unsupervised learning. Table 2 Some of Lung Cancer Primary Causes Attributes Attribute Type Chest pain Nominal Cough Nominal Coughing of blood Nominal Fatigue Nominal Losing weight without trying Nominal Loss of appetite Nominal Shortness of breathe Nominal Wheezing Nominal Table 3 Some of Lung Cancer Secondary Causes Attributes Attributes Type Bone pain or tenderness Nominal Eyelid drooping Nominal Facial Paralysis Nominal Hoarseness or changing voice Nominal Joint pain Nominal Nail problems Nominal Shoulder pain Nominal Swallowing difficulty Nominal Swelling of face or arms Nominal Weakness Nominal Fever Nominal Training Set Learning Set h x Predicted y Fig.1 Unsupervised Learning 2.4 Data Mining From the literature survey that has been done for the proposed system, it has been found that the Kohonen map could give high performance for the development of the tool since it can provide statistical insights and models for larger data sets [10]. Kohonen map is one of the powerful techniques for data mining through cluster analysis [11] . Kohonen map has the capability of imitating the human ISBN NO: 978-93-80609-14-0 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) -Special Issue ISSN 2278-6856 National Conference on Architecture, Software systems and Green computing-2013(NCASG2013) brain there by learning from the past experience (or data) and then making the classification. Since this ANN follows an unsupervised learning method, there is a high possibility to learrn more complex and larger models. Also, the learning can proceed hierarchically from the observations into ever more abstract levels of representation. At first, the Euclidean distance for the i1 is calculated where i1 is a patient record, from the weight vectors wj associated with each output node. (3) Select output node j* that has weight vector with minimum value as a result of the formula defined in (3). Then, update the weight values with all nodes within a topological distance given by D(t) from j* using the weight update rule. (4) This process is repeated for all the input vectors. Learning generally decreases with time. An output vector of length 2, one for the cancer positive category and the other one for the cancer negative category, will been identified Each of the p vectors in the training data is classified as falling into one of the clusters. Random weight values are assigned from the inputs to the outputs. The Euclidean distance is then calculated and the corresponding weight values are updated. This process is repeated for the number of iterations till the system learns. The accuracy of the tool increases with the increase in the number of training data. trained network. The classification is validated against the actual value and the accuracy is being calculated. 2.4 Data Mining The hidden pattern representing the knowledge from the Kohonen map is extracted by means of the resulting weight vector. The weight values along with its corresponding causes and symptoms are then analyzed by the doctors and the root cause of the disease can be found. The patient can then be treated accordingly based on his diagnosis result. 3. CONCLUSION In this paper, we have proposed a learning method based on unsupervised learning which can be used in building a predictive model for early detection of lung cancer. We also showed that this ANN can be used to predict the disease even with the occurrence of new symptoms. Also, the disease can be further analyzed by extracting the resultant weight vector after the training process. References [1] P. Wang, X. Chen, F. Xu, D. Lu, W. Cai, K. Ying, Y. Wang and Y. Hu, “Development of Electronic Nose for Diagnosis of Lung Cancer at Early Stage”, Proceedings of the 5th International Conference on Information Technology and Application in Biomedicine, Shenzhen, China, May 30-31, 2008. [2] R. Smith-Bindman, J. Lipson, R. Marcus, et al. (December 2009). “Radiation dose associated with common computed tomography examinations and the associated lifetime attributable risk of cancer”. Arch.Intern.Med.169 (22):2078-86. [3] C. Je-Yoel and S. Hye-Jin, “Proteomic approaches in lung cancer biomarker development”, PubMed , 2009 Feb;6(1):27-42. doi: 10.1586/14789450.6.1.27. [4] S. Mai, T. Tim , S. Rob , “ Using Data Mining Techniques in Heart Disease Diagnosis and Treatment”, Conference on Electronics, Communications and Computers, 2012 JapanEgypt. [5] T. Jia , Y. Wei, D. Wu, ”A Lung Cancer Lesions Detection Scheme Based on CT Image”, 2nd International Conference on Signal Processing Systems (ICSPS), 2012. Fig.2. Classification of cancer patient [6] L. Yang, Y. Jinzhu , Z. Dazhe,”A Method of Pulmonary Nodule Detection utilizing multiple support V Vector Machine”, International Conference on Computer Application and System Modelling, 2010. After the training has been completed in the above step, the identified testing samples are given as input to the ISBN NO: 978-93-80609-14-0 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) -Special Issue ISSN 2278-6856 National Conference on Architecture, Software systems and Green computing-2013(NCASG2013) [7] M. Iain , M. Darren, F. Mary-Jane ,”Classification of Cancer Stage from Free-text Histology Reports”, Proceedings of the 28th IEEE EMBS Annual International Conference New York City, USA, Aug 30-Sept 3, 2006. [8] D. Siqi, H. Tianlin , S. Yang, L. Chun, H. Yuanqing*, “ Detection of Lung Cancer with Breath Biomarkers Based on SVM Regression”, Fifth International Conference on Natural Computation 2009. [9] D. 1. HAND , "Data mining: Statistics and more?," The American Statistician, Vol. 52, No. 2, May 1998, pp.112-118. [10] O. Jason , A. Syed, “Data Mining Using Self Organizing Kohonen maps: A Technique for Effective Data Clustering & Visualization”, In International Conference on Artificial Intelligence (IC-AI'99), June 28-July 1 1999, Las Vegas [11] T. Kohonen, “Self-Organization and Associative Memory”, 3rd ed.,Berlin: Springer-Verlag, 1989. ISBN NO: 978-93-80609-14-0

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Early Diagnosis of Lung Cancer using a Mining Tool