Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Applied Data Mining Basic Decision Trees in R Myles Gartland Rockhurst University May 22, 2017 Data Mining: Concepts and Techniques 1 To GUI or Not to GUI • CLI • IDE • GUI May 22, 2017 Data Mining: Concepts and Techniques 2 Classification vs. Prediction • Classification – predicts categorical class labels (discrete or nominal) – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction – models continuous-valued functions, i.e., predicts unknown or missing values • Typical applications – Credit approval – Target marketing – Medical diagnosis – Fraud detection May 22, 2017 Data Mining: Concepts and Techniques 3 Classification—A Two-Step Process • Model construction: describing a set of predetermined classes – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute – The set of tuples used for model construction is training set – The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects – Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur – If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known May 22, 2017 Data Mining: Concepts and Techniques 4 Supervised vs. Unsupervised Learning • Supervised learning (classification) – Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations – New data is classified based on the training set • Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data May 22, 2017 Data Mining: Concepts and Techniques 5 Process (1): Model Construction Training Data NAME M ike M ary B ill Jim D ave A nne RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no May 22, 2017 Data Mining: Concepts and Techniques Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 6 Process (2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME T om M erlisa G eorge Joseph May 22, 2017 RANK YEARS TENURED A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes Data Mining: Concepts and Techniques Tenured? 7 Decision Tree with rpart • Recursive partitioning and regression trees. • Non-trademarked version of CART • Rules based on variables' values are selected to get the best split to differentiate observations based on the dependent variable • Once a rule is selected and splits a node into two, the same process is applied to each "child" node (i.e. it is a recursive procedure) May 22, 2017 Data Mining: Concepts and Techniques 8 • Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met. (Alternatively, the data are split as much as possible and then the tree is later pruned.) • Each branch of the tree ends in a terminal node. Each observation falls into one and exactly one terminal node, and each terminal node is uniquely defined by a set of rules. May 22, 2017 Data Mining: Concepts and Techniques 9 CP complexity parameter. • Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. For instance, with anova splitting, this means that the overall R-squared must increase by cp at each step. The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile. Essentially ,the user informs the program that any split which does not improve the fit by cp will likely be pruned off by cross-validation, and that hence the program need not pursue it. May 22, 2017 Data Mining: Concepts and Techniques 10 What is a Decision Tree May 22, 2017 Data Mining: Concepts and Techniques 11 The process • Start with the business question: “what are the characteristics of people who are likely to purchase a personal savings account” • Build the model – Create training and validation samples – Build a functional formula: SA ~ EV – Build the tree rpart(formula, data, control) – Test for accuracy – Build back into business rules May 22, 2017 Data Mining: Concepts and Techniques 12 Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they are discretized in advance) – Examples are partitioned recursively based on selected attributes – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left May 22, 2017 Data Mining: Concepts and Techniques 13 Random Forest • Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. May 22, 2017 Data Mining: Concepts and Techniques 14 Random Forest Limits • No missing data • Predictors cannot have more than 32 levels • rf.model <- randomForest(DV ~ ., data=, importance=TRUE, proximity=TRUE) May 22, 2017 Data Mining: Concepts and Techniques 15