Download PDF [FULL TEXT]

International Journal of Research in Computer and ISSN (Online) 2278- 5841 Communication Technology, Vol 3, Issue 4, April- 2014 ISSN (Print) 2320- 5156 Classification Methods in Data Mining: A Detailed Survey Mariammal.D1 ,Jayanthi.S2 and Dr.P.S.K.Patra 1 Department of CSE, Agni college of technology, Thalambur,Chennai,Tamilnadu, India [email protected] 2 Asst.prof,Department of CSE,Agni college of technology, Thalambur,Chennai,Tamilnadu, India [email protected] 3 The HOD,Department of CSE,Agni college of technology, Thalambur,Chennai,Tamilnadu, India [email protected] Abstract Classification is a data mining (machine learning) technique and a model finding process that is used for assigning the data into different classes according to specific constrains. In other words we can say that classification is used to predict group membership for data instances. There are several major kinds of classification algorithms including Genetic algorithm C4.5, Naïve Bayes, SVM, KNN, decision tree, Neural Network and CART. The goal of this survey is to provide a detailed review of two classification techniques and its applications in various emerging fields. This paper also presents the comparison of these techniques that are being widely used. classifying newly available data. Thus it can be outlined as an predictable part of data mining and is gaining more popularity. In the present paper a detailed study on three classification techniques and its applications have been made. Section II describes decision tree, and section IV deals with neural network. Finally last section concludes the paper. Index Terms— Classification,Neural network, Decision Tree 1. Introduction Data mining is the process of extracting interesting nontrivial, implicit and previously unknown patterns from huge volume of information repositories such as: relational database, data warehouses, transactional database, etc. Also data mining is known as one of the central part of Knowledge Discovery in Database (KDD). So data mining is very useful in collecting and managing the huge volume of data. Not only collecting and managing of data, data mining(DM) also includes analysis and prediction. Classification techniques in data mining are capable of processing a large amount of data so that large amount of data can be involved in processing.. It can predict categorical class labels and classifies data based on training set and class labels and hence can be used for www.ijrcct.org Page 503 504 II. DECISION TREE Decision tree is a analytical model which can be used to represent both classifiers and regression models.On the other hand, decision trees refer to a hierarchical model of decisions and their consequences. The decision maker employs decision trees to identify the strategy most likely to reach their goal. When a decision tree is used for classification, it is referred to as a classification tree. When it is used for regression, it is referred as a regression tree. In this paper we concentrate mainly on classification. Classification trees are mainly used to classify an object or an instance to a predefined set of classes based on their attributes. The classification tree is useful as an investigative technique. However it does not attempt to replace existing traditional statistical methods and there are many other techniques that can be used classify or predict the membership of instances to a predefined set of classes. A decision tree can be also used to analyze the payment nature of customers who received a credit. incoming edges. All other nodes have precisely one incoming edge. A node with outgoing edges is called “internal” or “test” node. All other nodes are referred as “leaves” (also known as decision nodes). Each internal nodes in a decision tree splits the instance space into two or more sub-spaces according to a certain discrete function of the input attribute values Each leaf is assigned to one class representing the most appropriate target value. On the other hand, the leaf may hold a probability vector (affinity vector) indicating the probability of the target attribute having a certain value. Internal nodes are represented as circles, whereas leaves are denoted as triangles. Two or more branches may grow from each internal node (i.e. not a leaf). Each node corresponds with a certain characteristic and the branches correspond with a specific range of values. These ranges must give a partition of the set of values of the given characteristic. Instances are classified by navigating them from the root of the tree down to a leaf, corresponding to the outcome of the tests along the path. Decision tree incorporates both nominal and numeric attributes. 2.2 Constructing decision trees Most algorithms that have been developed for learning decision trees are variations on a core algorithm that employs a top-down, greedy search through the space of possible decision trees. Decision tree programs construct a decision tree T from a set of training cases. Fig2: Decision tree for providing loan The use of a decision tree is a most popular technique in data mining. Many researchers are considering decision trees are popular due to their simplicity and transparency. Decision trees are easy to understand so it does not require any domain knowledge. Decision trees are generally represented graphically as hierarchical structures to making them easier to interpret. 2.1 Characteristics of Classification Trees: A decision tree is a classifier expressed as a recursive partition of the instance space. The decision tree is a directed tree with a node called a “root” which has no www.ijrcct.org function ID3 Input: (R: a set of non-target attributes, C: the target attribute, S: a training set) returns a decision tree; Begin If S is empty, return a single node with value Failure; If S consists records all with the same Value for the target attribute, return a single leaf node with that value; If R is empty, then return a single node with the value of the most frequent of the values of the target attribute that are found in records of S; [in that case there may be be errors, examples that will be improperly classified]; Let A be the attribute with largest Gain(A,S) among attributes in R; Let {aj| j=1,2, .., m} be the values of attribute A; Let {Sj| j=1,2, .., m} be the subsets of S consisting respectively of records with value aj for A; Return a tree with root labeled as A and arcs labeled a1, a2, .., am going respectively to the trees (ID3(R-{A}, C, S1), ID3(R-{A}, C, S2),ID3(R-{A}, C, Sm); Recursively apply ID3 to subsets {Sj| j=1,2, .., m} until they are empty end Page 504 505 Figure 2: ID3 Decision Tree Algorithm ID3 searches through the attributes of the training instances and extracts the attribute that best separates the given examples. If the attribute perfectly classifies the training sets then ID3 stops; otherwise it recursively operates on the m (where m = number of possible values of an attribute) partitioned subsets to get their best attribute to classify. The algorithm uses a greedy search, that is, it picks the best attribute and never looks back to reconsider earlier choices. Note that ID3 may misclassify data.The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. 2.3 Advantages of Decision Tree Approach: The advantages of decision tree classifier over traditional statistical classifier include its simplicity, ability to handle missing and noisy data, and non-parametric nature. Decision trees are not constrained by any lack of knowledge of the class distributions. It can be trained quickly, takes less computational time. 2.4 Applications of Decision Tree: This section shows some recent successes in applying decision tree learning to solve real-world problems. 1. Predicting Library Book Use: Decision trees are developed that predict the future use of books in a library. Forecasting book usage helps librarians to select low-usage titles and move them to relatively distant and less expensive off-site locations that use efficient compact storage techniques. For this, it is important to adopt a book choice strategy that minimizes the expected frequency of requesting removed titles. For any choice policy, this frequency depends, on the percentage of titles that have to be removed for off-site storage ;the higher percentage is, the higher this frequency is expected to be. 2. Exploring the Relationship Between the Research Octane Number and Molecular Substructures Figuring out what molecule information one needs to predict the research octane number (RON) is a nontrivial problem of particular interest to chemists. In the work of Blurock , substructure presence absence information is used for RON prediction, not only because this is believed to give good prediction results, but also because asking directly about the presence or absence of substructures in molecules is easily interpretable by chemists, and so, valuable intuitive information can be gained by studying the substructure–RON relationship. In addition to www.ijrcct.org demonstrating the predictive power of the learned decision trees, analyzing these trees was useful in providing insight about the significance of different substructures for RON prediction. These findings are viewed as a contribution to better understanding of the underlying principles that determine RON of molecules. 3. Characterization of Leiomyomatous Tumors The goal was to generate hypotheses about tumor diagnosis/prognosis problems when confronted with a large number of features. For a given tumor, it is desired to know to which group this tumor belongs and why.Traditionally, tumor characterization is made on the basis of features that are difficult for a pathologist to evaluate. The job is, thus, carried out subjectively, and the quality of the results is determined by the pathologist’s experience with the group of tumors concerned. To accomplish a higher level of objectivity, many more quantitative measurements (related to DNA content, morphonuclear characteristics, and immunohistochemical specificities) need to be considered. Furthermore, useful information can result from interactions between several of these features that cannot be detected using traditional univariate statistical analysis. In the work of Decaestecker et al. , decision tree learning was applied to the difficult problem of leiomyomatous (or soft muscle) tumor diagnosis. Furthermore, the authors note that the decision tree approach is more suitable for this task because it led to explicit logical rules that can be interpreted by human experts, which meet the exploratory nature of their job. 4. Star/Cosmic-Ray Classification in Hubble Space Telescope Images Salzberg et al. applied decision tree learning to the task of distinguishing between stars and cosmic rays in images collected by the Hubble Space Telescope. In addition to high accuracy, a classifier for this task must be fast due to the large number of classifications and to the need for online classification. In their experiments, a set of 2211 pre classified images was used as a training sample for decision tree construction, and a part of 2282 preclassified images was used to measure the generalization performance of the learned decision tree. Each of these images was described using 20 numerical features and labeled as either a star or a cosmic ray. The reported experiments show that quite compact decision trees (no more than 9 nodes) achieve generalization accuracy of over 95%. Moreover, the experiments suggest that this accuracy will get even higher when methods for eliminating background noise are employed. III. NEURAL NETWORK An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems(brain), process information. The key Page 505 506 element of the ANN paradigm is the novel structure of the information processing system. It is self-possessed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems [1].Neural network is configured for a specific application including pattern recognition or data classification, through a learning process. Learning process in biological systems involves adjustments to the synaptic connections that exist between the neurons. A neural network is a powerful data modeling tool that is able to capture and represent complex input/output relationships. The enthusiasm for the development of neural network technology stemmed from the desire to develop an artificial system that could perform "intelligent" tasks similar to those performed by the human brain.ANN resemble the human brain in the following two ways: they acquire knowledge through learning process, and the knowledge is stored within interneuron connection strengths known as synaptic weights [1,2]. The true power and advantage of neural networks lies in their ability to represent both linear and non-linear relationships and in their ability to learn these relationships directly from the data being modeled. 1. Network structures: An ANN may have either a recurrent or non recurrent structure. A recurrent network [4, 5] is a feedback network in which the network calculates its outputs based on the inputs and feeds them back to modify the inputs. For a stable recurrent network, this process normally produces smaller and smaller output changes until the output become constant. 2. Parallel processing ability: Each neuron in the ANN is a processing element similar to a Boolean logical unit in a conventional computer chip, except that a neuron’s function is programmable. Computations required to simulate ANNs are mainly matrix ones, and the parallel structure of the interconnection between neurons facilitates such calculations. 3. Distributed memory: The network does not store information in a central memory. Information is stored as patterns throughout the network structure. The state of neurons represents a short-term memory as it may change with the next input vector. The values in the weight matrix form a long-term memory and are changeable only on a longer time basis. 4. Fault tolerance ability: The network’s parallel processing ability and distributed memory make it relatively fault tolerant. In a neural computer, the failure of one or more parts may degrade the accuracy but it does not break the system. A system failure occurs only when all parts fail at the same time. This provides a measure of damage control. 3.2 Neural network algorithm: Fig 3: Diagram of 4-layer neural network with two hidden layers Conventional linear models are simply inadequate when it comes to modeling data that contains non-linear characteristics.The most common neural network model is known as a supervised network because it requires a desired output in order to learn. The objective of this network type is to create a model that maps the input to the output using historical data so that the model can then be used to produce the output when the desired output is unknown. A graphical representation of a Multi-Layer Perceptron (MLP) [1, 3] is shown in Figure 3. 3.1 Characteristics of neural networks: There are six main characteristics of ANN technology: the network structures, the parallel processing ability, the distributed memory, and the fault tolerance ability. www.ijrcct.org Step1:Inputtrainingvector Step2:Hidden nodes are calculate their outputs. Step3:Output nodes are calculate their outputs on the basisofStep2. Step 4: Calculate the differences between the results of the previous step and targets. Step 5: Apply the first half of the training rule using the results of Step 4. Step6: For all hidden node, n, calculate d(n). Step7: Apply the second part of the training rule using the results of Step 6. Steps 1 through 3 are often called the forward process, and steps 4 through 7 are often called the backward process. Therefore, the name: back-propagation. Table1:Comparison of Classification methods 3.3 Advantage of neural network: Page 506 507 • Neural network classifiers, without any priori assumptions about data distributions, are able to learn discontinuous patterns in the distribution of classes. • Neural networks can readily accommodate auxiliary data such as textural information, slope, and elevation. • Neural networks are quite flexible and can be adapted to improve performance for particular problems. 3.4 Applications of neural network: Application areas include system identification and control including vehicle control, process control, natural resources management, quantum chemistry, game-playing and decision making (backgammon, chess, poker), pattern recognition in the system like radar systems, face identification, object recognition and more, sequence recognition in gesture, speech, handwritten text recognition, medical diagnosis, financial applications in automated trading systems, data mining, visualization and e-mail spam filtering. ANN have also been used to diagnose several cancers. An Artificial Neural Network based hybrid lung cancer detection system (HLND) improves the accuracy of diagnosis and the speed of lung cancer radiology. It also been used to diagnose prostate cancer. The diagnoses process can be used to make specific models taken from a large group of patients compared to information of one given patient. This models do not depend on assumptions about correlations of different variables. Another type of cancer called Colorectal cancer has also been predicted using the neural networks. IV Comparison of Classification Methods: Table I gives a comparison on various parameters of these classification techniques. Classifi cation Method Decisio n Tree Generatie Or Discriminatve Discriminative Loss Function Bayesia n Networ k Neural Networ k Generative logP(X,Y ) Variable Elimination Discriminative Sumsquared error Forward Propagation Zero-one loss Classification methods in data mining are typically strong in modeling interactions This paper covers two classification techniques widely used in data mining. These two technique has got its own pros and cons as given in this paper. Decision trees and Neural Network (NN) generally have different operational profiles, when one is very accurate the other is not and vice versa. On the contrary, decision trees and rule classifiers have a similar operational profile. The goal of classification result integration algorithms is to generate more certain, precise and accurate system results. References [1] José C. Principe, Neil R. Euliano, Curt W. Lefebvre “Neural and Adaptive Systems: Fundamentals Through Simulations”, ISBN 0-471-35167-9 [2] NeuroIntelligence-Alyuda Research, http://www.alyuda.com/neural-network-software.htm [3] NeuroDimension Inc web site. - Neural Network Software, http://www.nd.com/ [4] Hopfield, J.J. “Neural Networks and Physical Systems with Emergent Collective Computational Abilities”, Proceedings of the Nationul Academy of Science, Vol. 79. [511982. pp. 255442558. Hopfield, J.J. “Neurons with Graded Response Have Collective Computational Properties Like Those of Two State Neurons”, Proceedings qf the National Academy of Science, Vol. 81, 1984, pp. 30X8-3092. [6] Dong Xiao Ni” Application of Neural Networks to Character Recognition” Proceedings of Students/Faculty Research Day, CSIS, Pace University, May 4th, 2007 [7] Eldon Y. Li,” Artificial neural networks and their business applications” Information & Management 27 (1994) 303-313 [8] Thair Nu Phyu,” Survey of Classification Techniques in Data Mining”Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I IMECS 2009, March 18 - 20, 2009, Hong Kong [9] Ms. Aparna Raj, Mrs. Bincy , Mrs. T.Mathu,” Survey on Common Data Mining Classification Techniques” International Journal of Wisdom Based Computing, Vol. 2(1), April 2012 Parameter estimation algorithm C4.5 Conclusion www.ijrcct.org Page 507

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PDF [FULL TEXT]