Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 4 (July-August 2014) ISSN: 2250-1797 Bayesian Neural Networks with Confidence Estimations in Data Mining Approach Based On Support Vector Machines G.Sankara Subramaniam1, Dr. Ashish Chaturvedi2 Research Scholar, Department of Computer Science and Engineering, Himalayan University, Arunachal Pradesh, India, Professor and Associate Director, Arni School of Computer Science and Applications, Arni University, Indora (Kathgarh), Himachal Pradesh, India. Abstract— Genetic algorithms are inspired by Darwin's principle of evolution and genetics [1]. Probabilistic algorithms that provide a mechanism for parallel and adaptive search based on the principle of survival of the fittest and reproduction. This article introduces the intelligent computational techniques, Neural Networks, Fuzzy Logic and Expert Systems, and presents the basic principles and applications of Genetic Algorithms. Genetic Algorithm is used in optimization problems generally. In this work GA is used in classification and feature reduction in automatic disease diagnosis for cervical cancer. The aims of this research are: To investigate the use of Genetic Algorithm in developing a robust, generalizable classifier using Linear Discriminant Analysis to classify the bio-chemical parameters of cervical cancer. Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. Index Terms— machine learning, neural networks, rule extraction, comprehensible I. INTRODUCTION Genetic Algorithm provides a mechanism for handling imprecise information such as the concepts of very, little, small, tall, good, hot, cold, etc., providing an approximate answer to a question based on a knowledge that is inaccurate, incomplete or not totally reliable. Expert systems are computer programs to solve problems in a specialized field of human knowledge. Uses AI techniques, knowledge base and inferential reasoning. Techniques of Computational Intelligence have been employed successfully in the development of intelligent systems forecasting, decision support, control, optimization, modeling, classification and pattern recognition in general, applied in several sectors: Energy, Industrial, Economic, Financial, Commercial and Other, Synthesis Circuit Mode Environment, among others. [2, 6, 7, 8]. The principles of nature in which GAs are based are simple. According to C. Darwin's theory, the principle of selection favors the fittest individuals living longer and therefore most likely to play. Individuals with more offspring are more likely to perpetuate their genetic codes in the next generations. Such genetic codes constitute the identity of each individual and are represented in the chromosomes. These principles are imitated in the construction of computational algorithms that seek a better solution for a given problem through the development of populations of solutions encoded by artificial chromosomes. GAs on a chromosome is a data structure that represents a possible solution of the search space of the problem. Chromosomes are then subjected to an evolutionary process R S. Publication (rspublication.com), [email protected] Page 62 International International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 4 (July-August 2014) ISSN: 2250-1797 that involves evaluation, selection, sexual recombination (crossover) and mutation. After several cycles of evolution the population should contain fittest individuals. GAs on a chromosome is a data structure that represents a possible solution of the search space of the problem. Chromosomes are then subjected to an evolutionary process that involves evaluation, selection, sexual recombination (crossover) and mutation. After several cycles of evolution the population should contain fittest individuals. Figure 1.1 Architecture of a typical data mining system II. DIFFERENT TYPES OF DATA MINING PROBLEMS Classifications problems aim to identify the characteristics that indicate the groups to which each case belongs. This pattern can be used both to understand the existing data and to predict how new instances will behave. For example the company may want to predict whether the client will respond positively or negatively for the new lunching product. Data mining creates classification models by examining already classified data and inductively finding a predictive pattern. These existing data may come from an historical database, such as people who have already undergone a particular medical treatment. They may come from an experiment in which a sample of the entire database is tested in real world and the results used to create a classifier. Clustering divides a dataset into different groups. The goal of clustering is to find groups that are very different from each other, and whose members are very similar to each other with in the group [5]. In this clustering we do not know what the cluster will look like when we start or by which attributes the data will be clustered. After we found the clusters that reasonably segment the dataset, these clusters may then be used to classify new data. K-means and kohonen feature maps are some of the algorithms used for clustering. Segmentation refers to the general problem of identifying groups that have common properties. Clustering is a way to segment data into groups that are not previously defined, whereas classification is a way to segment data by assigning it to groups that are already found R S. Publication (rspublication.com), [email protected] Page 63 International International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 4 (July-August 2014) ISSN: 2250-1797 Fig 2.1 Data mining life cycle [2] The selection operator is an essential component of genetic algorithms. The literature identifies five key selection mechanisms: proportional, for tournaments, with truncation by linear normalization and standardization exponential [5]. A selection mechanism is characterized by selective pressure or selection intensity that it introduces the genetic algorithm. The term selective pressure is used in different contexts and with different meanings in the evolutionary computation literature. III. METHODOLOGY Genetic Algorithm is a stochastic global optimization method based on “survival of the fittest”. This is a computational technique which is used to solve problems in which there are many potential solutions, only one of which is optimal. A GA has five main parts, representation, fitness, selection, crossover and mutation. The method proceeds through generations until it is determined that the global minimum has been reached. The size of the population is usually fixed, and each member of the population contains a complete set of parameters for the function being searched. A set of candidate solutions to the problem being studied are generated at random and are grouped together into a population, with each solution being a member of this population. This randomness is important, since we do not wish to bias the solution in any way, and it is also this randomness that makes this method a truly ab initio search technique. There needs to be a way of determining the fitness of each member, i.e. some way of telling which members are better or worse solutions to the problem than the other members. It is this fitness function which defines the problem. For some problems the fitness function is quite complicated, but for others it can be quite a simple form. Population members will be chosen or selected based on their fitness, to become parents to produce offspring in a breeding procedure known as crossover. Crossover involves creating one or more offspring which are a combination of features from their parents. The selection process is quite important in preventing the system stagnating through the generations. Stagnation occurs when one good but not optimal structure will be found which biases the system away from the global minimum solution and the population becomes trapped in this local minima. The form of the fitness function is important in this, but the way that members are selected is also important and will be discussed later. Each population member needs to be encoded or represented in some way such that crossover can be performed in a systematic way. The form of this representation is very important since it can affect how stable salient features of members are during the crossover procedure. A representation that is most efficient for crossover steps may be less efficient when considering fitness calculation. R S. Publication (rspublication.com), [email protected] Page 64 International International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 4 (July-August 2014) ISSN: 2250-1797 Fig 3.1 Full Tree Diagram for Training Data The full classification tree contains 4 levels starting from the level 0 to 3. The terminal nodes for class 1 and class 0 are in the last level-3. The decision nodes are located in the in between levels level 0 to 2. The level 0 had one decision node. It used the spilt variable called insulin (2-Hour serum insulin). It made the first split partition. The insulin split variable split value is 142. In the first level there are two decision nodes namely insulin and). The second level have 4 decision nodes namely Pg (Plasma glucose level), age and dbp (Diastolic blood pressure) IV. RESULT Data mining applications depends on the datasets to supply the raw input. If the dataset is dynamic, incomplete, noisy and large then it may create more problems to the data mining application. The difficulties in data mining can be grouped in as below. There are many clustering approaches, all based on the principle of maximizing the similarity between objects in a same class (intra-class similarity) and minimizing the similarity between objects of different classes (inter-class similarity). Clustering is similar to classification, but classes are not predefined and it is up to the clustering algorithm to discover acceptable classes. Often, it is necessary to modify the clustering by excluding variables that have been employed to group instances because, upon examination, the user identifies them as irrelevant or not meaningful. After clusters are found that reasonably segment the database, these clusters are then used to classify new data. Some of the common algorithms used to perform clustering include Kohonen feature maps and K-means. Clustering is different from segmentation. Segmentation refers to the general problem of identifying groups that have common characteristics whereas clustering is a way to segment data into groups that are not previously defined. Clustering studies are also referred to as unsupervised learning. Unsupervised learning is a process of classification with an unknown target, that is, the class of each case is unknown. The aim is to segment the cases into disjoint classes that are homogenous with respect to the inputs. Clustering studies have no dependent variables. Clustering is one of the most useful tasks in data mining process for discovering groups and identifying interesting distributions and patterns in the underlying data. Clustering problem is about partitioning a given data set into groups (clusters) such that the data points in a cluster are more similar to each other than points in different clusters. R S. Publication (rspublication.com), [email protected] Page 65 International International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 4 (July-August 2014) ISSN: 2250-1797 Fig 4 KNN Training Error Curve for Different K Values In data mining, a decision tree is a predictive model which can be used to represent both classifiers and regression models. Decision tree algorithm is a data mining induction technique that recursively partitions a dataset of records using either depth-first greedy approach or breadth-first approach until all the data items belong to a particular class (Han and Kamber 2008). Decision trees constitute a way of representing a series of rules that lead to a class or value and it is a powerful and popular tool for classification and prediction. Decision trees are also useful for exploring data to gain insight into the relationships of a large number of candidate input variables to the target variable. Figure 5 Created Neural Network Classification System Output The performance of the classifier depends on the dataset it is used for training and testing. Every machine learning algorithm has its own merits and demerits. There is no single machine learning algorithm which is going to give the best result for all the type of datasets. Data mining in medical field is a challenging task because of the complexity in the medical domain. In this research Hybrid classifiers system gave the reliability and performance which is the two top most expected priorities in the medical diagnosis task. V. DISCUSSION AND CONCLUSION Data mining searches databases to find hidden patterns and predict information to increase the business in the organization. Data mining is the nontrivial extraction of implicit, previously unknown, interesting and potentially useful information from data. R S. Publication (rspublication.com), [email protected] Page 66 International International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 4 (July-August 2014) ISSN: 2250-1797 The successful data mining applications in the fields of e- business, marketing and retail led the other sectors and industries like health care to use the data mining techniques to improve its performance. Data Mining for Biological and Environment problems is one of the ten challenging problems what data mining industry is facing today. Many researchers believe that mining biological data continues to be an extremely important problem, both for data mining research and for biomedical sciences. We have selected the Pima Indian Diabetes dataset for our research because it’s already used by the many machine learning researchers to test their algorithms classification performances. The dataset is more reliable to use because lot of researchers used the dataset and published lot of research papers based on it. The data set is very difficult to classify it gave 26.20% as miss classification error rate one of the reason is it has lot of missing values in it. The Genetic Algorithm is used for prediction and the feature reduction successfully. The GUI used for predicting the cervical cancer through the given biochemical parameters. The percentage of risk is shown to the user by prediction with the known dataset. Though the medical diagnosis is regarded as an important yet complicated task that needs to be executed accurately and efficiently. The automation of such systems would be extremely advantageous. Regrettably all doctors do not possess expertise in every sub specialty and moreover there is a shortage of resource persons at certain places. Therefore, an automatic medical diagnosis system would probably be exceedingly beneficial by bringing all of them together. This system uses Genetic algorithm which provides optimal solution even if the problem has a bigger source of data. This predicts cervical cancer through the biochemical parameters. Otherwise the diagnosis usually requires the histological examination of a tissue biopsy specimen by a pathologist, although the initial indication of malignancy can be symptomatic or radiographic imaging abnormalities which is a painful process. Predicting the unknown data has more vital role in data analysis and retrieval. REFERENCES [1] García, S., Fernández, A., Luengo, J., & Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10), 2044-2064. [2] Koh, H. C., & Tan, G. (2011). Data mining applications in healthcare. Journal of Healthcare Information Management—Vol, 19(2), 65. [3] Romero, C., & Ventura, S. (2010). Educational data mining: a review of the state of the art. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 40(6), 601-618. [4] Geppert, H., Vogt, M., & Bajorath, J. (2010). Current trends in ligand-based virtual screening: molecular representations, data mining methods, new application areas, and performance evaluation. Journal of chemical information and modeling, 50(2), 205-216. [5] Thelwall, M., Wilkinson, D., & Uppal, S. (2010). Data mining emotion in social network communication: Gender differences in MySpace. Journal of the American Society for Information Science and Technology, 61(1), 190-199. R S. Publication (rspublication.com), [email protected] Page 67 International International Journal of Computer Application Available online on http://www.rspublication.com/ijca/ijca_index.htm Issue 4, Volume 4 (July-August 2014) ISSN: 2250-1797 [6] Friedman, A., & Schuster, A. (2010, July). Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 493-502). ACM. [7] Kumar, V., & Chadha, A. (2011). An Empirical Study of the Applications of Data Mining Techniques in Higher Education. International Journal of Advanced Computer Science and Applications, 2(3), 80-84. [8] Liu, H., Motoda, H., Setiono, R., & Zhao, Z. (2010). Feature Selection: An Ever Evolving Frontier in Data Mining. Journal of Machine Learning ResearchProceedings Track, 10, 4-13. [9] Ball, N. M., & Brunner, R. J. (2010). Data mining and machine learning in astronomy. International Journal of Modern Physics D, 19(07), 1049-1106. [10] Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems, 50(3), 559-569. [11] Duan, L., Street, W. N., & Xu, E. (2011). Healthcare information systems: data mining methods in the creation of a clinical recommender system. Enterprise Information Systems, 5(2), 169-181. [12] Linoff, G. S., & Berry, M. J. (2011). Data mining techniques: for marketing, sales, and customer relationship management. John Wiley & Sons. [13] Mohammed, N., Chen, R., Fung, B., & Yu, P. S. (2011, August). Differentially private data release for data mining. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 493501). ACM. [14] Alavi, A. H., & Gandomi, A. H. (2011). A robust data mining approach for formulation of geotechnical engineering systems. Engineering Computations,28(3), 242-274. [15] Helbing, D., & Balietti, S. (2011). From social data mining to forecasting socio-economic crises. The European Physical Journal Special Topics, 195(1), 3-68. R S. Publication (rspublication.com), [email protected] Page 68