Download document 8900399

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
International International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 4 (July-August 2014)
ISSN: 2250-1797
Bayesian Neural Networks with Confidence Estimations in Data
Mining Approach Based On Support Vector Machines
G.Sankara Subramaniam1, Dr. Ashish Chaturvedi2
Research Scholar, Department of Computer Science and Engineering,
Himalayan University, Arunachal Pradesh, India,
Professor and Associate Director, Arni School of Computer Science and Applications,
Arni University, Indora (Kathgarh), Himachal Pradesh, India.
Abstract— Genetic algorithms are inspired by Darwin's principle of evolution and genetics [1].
Probabilistic algorithms that provide a mechanism for parallel and adaptive search based on the
principle of survival of the fittest and reproduction. This article introduces the intelligent
computational techniques, Neural Networks, Fuzzy Logic and Expert Systems, and presents the basic
principles and applications of Genetic Algorithms. Genetic Algorithm is used in optimization
problems generally. In this work GA is used in classification and feature reduction in
automatic disease diagnosis for cervical cancer. The aims of this research are: To investigate
the use of Genetic Algorithm in developing a robust, generalizable classifier using Linear
Discriminant Analysis to classify the bio-chemical parameters of cervical cancer. Data
mining is a process that uses a variety of data analysis tools to discover patterns and
relationships in data that may be used to make valid predictions.
Index Terms— machine learning, neural networks, rule extraction, comprehensible
I. INTRODUCTION
Genetic Algorithm provides a mechanism for handling imprecise information such as the
concepts of very, little, small, tall, good, hot, cold, etc., providing an approximate answer to a
question based on a knowledge that is inaccurate, incomplete or not totally reliable. Expert
systems are computer programs to solve problems in a specialized field of human knowledge.
Uses AI techniques, knowledge base and inferential reasoning. Techniques of Computational
Intelligence have been employed successfully in the development of intelligent systems
forecasting, decision support, control, optimization, modeling, classification and pattern
recognition in general, applied in several sectors: Energy, Industrial, Economic, Financial,
Commercial and Other, Synthesis Circuit Mode Environment, among others. [2, 6, 7, 8]. The
principles of nature in which GAs are based are simple.
According to C. Darwin's theory, the principle of selection favors the fittest individuals
living longer and therefore most likely to play. Individuals with more offspring are more likely
to perpetuate their genetic codes in the next generations. Such genetic codes constitute the
identity of each individual and are represented in the chromosomes. These principles are
imitated in the construction of computational algorithms that seek a better solution for a given
problem through the development of populations of solutions encoded by artificial
chromosomes. GAs on a chromosome is a data structure that represents a possible solution of
the search space of the problem. Chromosomes are then subjected to an evolutionary process
R S. Publication (rspublication.com), [email protected]
Page 62
International International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 4 (July-August 2014)
ISSN: 2250-1797
that involves evaluation, selection, sexual recombination (crossover) and mutation. After
several cycles of evolution the population should contain fittest individuals. GAs on a
chromosome is a data structure that represents a possible solution of the search space of the
problem. Chromosomes are then subjected to an evolutionary process that involves evaluation,
selection, sexual recombination (crossover) and mutation. After several cycles of evolution the
population should contain fittest individuals.
Figure 1.1 Architecture of a typical data mining system
II. DIFFERENT TYPES OF DATA MINING PROBLEMS
Classifications problems aim to identify the characteristics that indicate the groups to
which each case belongs. This pattern can be used both to understand the existing data and to
predict how new instances will behave. For example the company may want to predict
whether the client will respond positively or negatively for the new lunching product. Data
mining creates classification models by examining already classified data and
inductively finding a predictive pattern. These existing data may come from an historical
database, such as people who have already undergone a particular medical treatment. They
may come from an experiment in which a sample of the entire database is tested in real world
and the results used to create a classifier. Clustering divides a dataset into different groups.
The goal of clustering is to find groups that are very different from each other, and whose
members are very similar to each other with in the group [5]. In this clustering we do not
know what the cluster will look like when we start or by which attributes the data will be
clustered. After we found the clusters that reasonably segment the dataset, these clusters may
then be used to classify new data. K-means and kohonen feature maps are some of the
algorithms used for clustering. Segmentation refers to the general problem of identifying
groups that have common properties. Clustering is a way to segment data into groups that
are not previously defined, whereas classification is a way to segment data by assigning
it to groups that are already found
R S. Publication (rspublication.com), [email protected]
Page 63
International International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 4 (July-August 2014)
ISSN: 2250-1797
Fig 2.1 Data mining life cycle [2]
The selection operator is an essential component of genetic algorithms. The literature
identifies five key selection mechanisms: proportional, for tournaments, with truncation by
linear normalization and standardization exponential [5]. A selection mechanism is
characterized by selective pressure or selection intensity that it introduces the genetic
algorithm. The term selective pressure is used in different contexts and with different
meanings in the evolutionary computation literature.
III. METHODOLOGY
Genetic Algorithm is a stochastic global optimization method based on “survival of the
fittest”. This is a computational technique which is used to solve problems in which there are
many potential solutions, only one of which is optimal. A GA has five main parts,
representation, fitness, selection, crossover and mutation. The method proceeds through
generations until it is determined that the global minimum has been reached. The size of the
population is usually fixed, and each member of the population contains a complete set of
parameters for the function being searched. A set of candidate solutions to the problem being
studied are generated at random and are grouped together into a population, with each solution
being a member of this population. This randomness is important, since we do not wish to bias
the solution in any way, and it is also this randomness that makes this method a truly ab initio
search technique. There needs to be a way of determining the fitness of each member, i.e.
some way of telling which members are better or worse solutions to the problem than the other
members. It is this fitness function which defines the problem. For some problems the fitness
function is quite complicated, but for others it can be quite a simple form. Population members
will be chosen or selected based on their fitness, to become parents to produce offspring in a
breeding procedure known as crossover. Crossover involves creating one or more offspring
which are a combination of features from their parents. The selection process is quite
important in preventing the system stagnating through the generations. Stagnation occurs
when one good but not optimal structure will be found which biases the system away from the
global minimum solution and the population becomes trapped in this local minima. The form
of the fitness function is important in this, but the way that members are selected is also
important and will be discussed later. Each population member needs to be encoded or
represented in some way such that crossover can be performed in a systematic way. The form
of this representation is very important since it can affect how stable salient features of
members are during the crossover procedure. A representation that is most efficient for
crossover steps may be less efficient when considering fitness calculation.
R S. Publication (rspublication.com), [email protected]
Page 64
International International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 4 (July-August 2014)
ISSN: 2250-1797
Fig 3.1 Full Tree Diagram for Training Data
The full classification tree contains 4 levels starting from the level 0 to 3. The
terminal nodes for class 1 and class 0 are in the last level-3. The decision nodes are located in
the in between levels level 0 to 2. The level 0 had one decision node. It used the spilt variable
called insulin (2-Hour serum insulin). It made the first split partition. The insulin split variable
split value is 142. In the first level there are two decision nodes namely insulin and). The
second level have 4 decision nodes namely Pg (Plasma glucose level), age and dbp (Diastolic
blood pressure)
IV. RESULT
Data mining applications depends on the datasets to supply the raw input. If the
dataset is dynamic, incomplete, noisy and large then it may create more problems to the data
mining application. The difficulties in data mining can be grouped in as below.
There are many clustering approaches, all based on the principle of maximizing the
similarity between objects in a same class (intra-class similarity) and minimizing the similarity
between objects of different classes (inter-class similarity). Clustering is similar to
classification, but classes are not predefined and it is up to the clustering algorithm to discover
acceptable classes. Often, it is necessary to modify the clustering by excluding variables that
have been employed to group instances because, upon examination, the user identifies them as
irrelevant or not meaningful. After clusters are found that reasonably segment the database,
these clusters are then used to classify new data. Some of the common algorithms used to
perform clustering include Kohonen feature maps and K-means. Clustering is different from
segmentation. Segmentation refers to the general problem of identifying groups that have
common characteristics whereas clustering is a way to segment data into groups that are not
previously defined. Clustering studies are also referred to as unsupervised learning.
Unsupervised learning is a process of classification with an unknown target, that is, the class
of each case is unknown. The aim is to segment the cases into disjoint classes that are
homogenous with respect to the inputs. Clustering studies have no dependent variables.
Clustering is one of the most useful tasks in data mining process for discovering groups and
identifying interesting distributions and patterns in the underlying data. Clustering problem is
about partitioning a given data set into groups (clusters) such that the data points in a cluster
are more similar to each other than points in different clusters.
R S. Publication (rspublication.com), [email protected]
Page 65
International International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 4 (July-August 2014)
ISSN: 2250-1797
Fig 4 KNN Training Error Curve for Different K Values
In data mining, a decision tree is a predictive model which can be used to represent both
classifiers and regression models. Decision tree algorithm is a data mining induction technique
that recursively partitions a dataset of records using either depth-first greedy approach or
breadth-first approach until all the data items belong to a particular class (Han and Kamber
2008). Decision trees constitute a way of representing a series of rules that lead to a class or
value and it is a powerful and popular tool for classification and prediction. Decision trees are
also useful for exploring data to gain insight into the relationships of a large number of
candidate input variables to the target variable.
Figure 5 Created Neural Network Classification System Output
The performance of the classifier depends on the dataset it is used for training and testing.
Every machine learning algorithm has its own merits and demerits. There is no single machine
learning algorithm which is going to give the best result for all the type of datasets. Data
mining in medical field is a challenging task because of the complexity in the medical domain.
In this research Hybrid classifiers system gave the reliability and performance which is the
two top most expected priorities in the medical diagnosis task.
V. DISCUSSION AND CONCLUSION
Data mining searches databases to find hidden patterns and predict information
to increase the business in the organization. Data mining is the nontrivial extraction of
implicit, previously unknown, interesting and potentially useful information from data.
R S. Publication (rspublication.com), [email protected]
Page 66
International International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 4 (July-August 2014)
ISSN: 2250-1797
The successful data mining applications in the fields of e- business, marketing and retail led
the other sectors and industries like health care to use the data mining techniques to
improve its performance. Data Mining for Biological and Environment problems is one of
the ten challenging problems what data mining industry is facing today. Many researchers
believe that mining biological data continues to be an extremely important problem, both for
data mining research and for biomedical sciences.
We have selected the Pima Indian Diabetes dataset for our research because it’s already
used by the many machine learning researchers to test their algorithms classification
performances. The dataset is more reliable to use because lot of researchers used the dataset
and published lot of research papers based on it. The data set is very difficult to classify it gave
26.20% as miss classification error rate one of the reason is it has lot of missing values in it.
The Genetic Algorithm is used for prediction and the feature reduction successfully. The
GUI used for predicting the cervical cancer through the given biochemical parameters. The
percentage of risk is shown to the user by prediction with the known dataset. Though the
medical diagnosis is regarded as an important yet complicated task that needs to be executed
accurately and efficiently. The automation of such systems would be extremely advantageous.
Regrettably all doctors do not possess expertise in every sub specialty and moreover there is
a shortage of resource persons at certain places. Therefore, an automatic medical
diagnosis system would probably be exceedingly beneficial by bringing all of them
together. This system uses Genetic algorithm which provides optimal solution even if the
problem has a bigger source of data. This predicts cervical cancer through the biochemical
parameters. Otherwise the diagnosis usually requires the histological examination of a tissue
biopsy specimen by a pathologist, although the initial indication of malignancy can be
symptomatic or radiographic imaging abnormalities which is a painful process. Predicting
the unknown data has more vital role in data analysis and retrieval.
REFERENCES
[1] García, S., Fernández, A., Luengo, J., & Herrera, F. (2010). Advanced nonparametric
tests for multiple comparisons in the design of experiments in computational
intelligence and data mining: Experimental analysis of power. Information Sciences,
180(10), 2044-2064.
[2] Koh, H. C., & Tan, G. (2011). Data mining applications in healthcare. Journal of
Healthcare Information Management—Vol, 19(2), 65.
[3] Romero, C., & Ventura, S. (2010). Educational data mining: a review of the state of
the art. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE
Transactions on, 40(6), 601-618.
[4] Geppert, H., Vogt, M., & Bajorath, J. (2010). Current trends in ligand-based virtual
screening: molecular representations, data mining methods, new application areas,
and performance evaluation. Journal of chemical information and modeling, 50(2),
205-216.
[5] Thelwall, M., Wilkinson, D., & Uppal, S. (2010). Data mining emotion in social
network communication: Gender differences in MySpace. Journal of the American
Society for Information Science and Technology, 61(1), 190-199.
R S. Publication (rspublication.com), [email protected]
Page 67
International International Journal of Computer Application
Available online on http://www.rspublication.com/ijca/ijca_index.htm
Issue 4, Volume 4 (July-August 2014)
ISSN: 2250-1797
[6] Friedman, A., & Schuster, A. (2010, July). Data mining with differential privacy. In
Proceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining (pp. 493-502). ACM.
[7] Kumar, V., & Chadha, A. (2011). An Empirical Study of the Applications of Data
Mining Techniques in Higher Education. International Journal of Advanced Computer
Science and Applications, 2(3), 80-84.
[8] Liu, H., Motoda, H., Setiono, R., & Zhao, Z. (2010). Feature Selection: An Ever
Evolving Frontier in Data Mining. Journal of Machine Learning ResearchProceedings Track, 10, 4-13.
[9] Ball, N. M., & Brunner, R. J. (2010). Data mining and machine learning in
astronomy. International Journal of Modern Physics D, 19(07), 1049-1106.
[10]
Ngai, E. W. T., Hu, Y., Wong, Y. H., Chen, Y., & Sun, X. (2011). The
application of data mining techniques in financial fraud detection: A classification
framework and an academic review of literature. Decision Support Systems, 50(3),
559-569.
[11]
Duan, L., Street, W. N., & Xu, E. (2011). Healthcare information systems:
data mining methods in the creation of a clinical recommender system. Enterprise
Information Systems, 5(2), 169-181.
[12]
Linoff, G. S., & Berry, M. J. (2011). Data mining techniques: for marketing,
sales, and customer relationship management. John Wiley & Sons.
[13]
Mohammed, N., Chen, R., Fung, B., & Yu, P. S. (2011, August).
Differentially private data release for data mining. In Proceedings of the 17th ACM
SIGKDD international conference on Knowledge discovery and data mining (pp. 493501). ACM.
[14]
Alavi, A. H., & Gandomi, A. H. (2011). A robust data mining approach for
formulation of geotechnical engineering systems. Engineering Computations,28(3),
242-274.
[15]
Helbing, D., & Balietti, S. (2011). From social data mining to forecasting
socio-economic crises. The European Physical Journal Special Topics, 195(1), 3-68.
R S. Publication (rspublication.com), [email protected]
Page 68