Download Decision Tree Models in Data Mining

Decision Tree Models in Data Mining Matthew J. Liberatore Thomas Coghlan Decision Trees in Data Mining      Decision Trees can be used to predict a categorical or a continuous target (called regression trees in the latter case) Like logistic regression and neural networks decision trees can be applied for classification and prediction Unlike these methods no equations are estimated A tree structure of rules over the input variables are used to classify or predict the cases according to the target variable The rules are of an IF-THEN form – for example:  If Risk = Low, then predict on-time payment of a loan Decision Tree Approach    A decision tree represents a hierarchical segmentation of the data The original segment is called the root node and is the entire data set The root node is partitioned into two or more segments by applying a series of simple rules over an input variables    Each resulting segment can be further partitioned into sub-segments, and so on   For example, risk = low, risk = not low Each rule assigns the observations to a segment based on its input value For example risk = low can be partitioned into income = low and income = not low The segments are also called nodes, and the final segments are called leaf nodes or leaves Decision Tree Example – Loan Payment Income < $30k >= $30k Age Credit Score < 25 >=25 < 600 >= 600 not on-time on-time not on-time on-time Growing the Decision Tree     Growing the tree involves successively partitioning the data – recursively partitioning If an input variable is binary, then the two categories can be used to split the data If an input variable is interval, a splitting value is used to classify the data into two segments For example, if household income is interval and there are 100 possible incomes in the data set, then there are 100 possible splitting values  For example, income < $30k, and income >= $30k Evaluating the partitions    When the target is categorical, for each partition of an input variable a chi-square statistic is computed A contingency table is formed that maps responders and non-responders against the partitioned input variable For example, the null hypothesis might be that there is no difference between people with income <$30k and those with income >=$30k in making an on-time loan payment  The lower the significance or p-value, the more likely that we reject this hypothesis, meaning that this income split is a discriminating factor Contingency Table $<30k Payment on-time Payment not on-time total $>=30k total Chi-Square Statistic  The chi-square statistic computes a measure of how different the number of observations is in each of the four cells as compared to the expected number    The p-value associated with the null hypothesis is computed Enterprise Miner then computes the logworth of the p-value, logworth = - log10(p-value) The split that generates the highest logworth for a given input variable is selected Growing the Tree     In our loan payment example, we have three intervalvalued input variables: income, age, and credit score We compute the logworth of the best split for each of these variables We then select the variable that has the highest logworth and use its split – suppose it is income Under each of the two income nodes, we then find the logworth of the best split of age and credit score and continue the process - subject to meeting the threshold on the significance of the chisquare value for splitting and other stopping criteria (described later) Other Splitting Criteria for a Categorical Target     The gini and entropy measures are based on how heterogeneous the observations are at a given node  relates to the mix of responders and non-responders at the node Let p1 and p0 represent the proportion of responders and nonresponders at a node, respectively If two observations are chosen (with replacement) from a node, the probability that they are either both responders or both nonresponders is (p1)2 + (p0)2 The gini index = 1 – [(p1)2 + (p0)2], the probability that both observations are different  Best case is a gini index of 0 (all observations are the same)  An index of ½ means both groups equally represented Other Splitting Criteria for a Categorical Target    The rarity of an event is defined as: -log2(pi) Entropy sums up the rarity of response and non-response over all observations Entropy ranges from the best case of 0 (all responders or all non-responders) to 1 (equal mix of responders and non-responders) Splitting Criteria for a Continuous (Interval) Target     An F-statistic is used to measure the degree of separation of a split for an interval target, such as revenue Similar to the sum of squares discussion under multiple regression, the F-statistic is based on the ratio of the sum of squares between the groups and the sum of squares within groups, both adjusted for the number of degrees of freedom The null hypothesis is that there is no difference in the target mean between the two groups As before, the logworth of the p-value is computed Some Adjustments  The more possible splits of an input variable, the less accurate the p-value (bigger chance of rejecting the null hypothesis)   If there are m splits, the Bonferroni adjustment adjusts the p-value of the best case by subtracting log10(m) from the logworth If Time of Kass Adjustment is set to before then the p-values of the splits are compared with Bonferroni adjustment Some Adjustments  Setting Split Adjustment property to Yes means that the significance of the p-value can be adjusted by the depth of the tree    For example, at the fourth split, a calculate p-value of 0.04 becomes 0.04*24 = 0.64, making the split statistically insignificant This leads to rejecting more splits, limiting the size of the tree Tree growth can also be controlled by setting:    Leaf Size property (minimum number of observations in a leaf) Split Size property (minimum number of observations to allow a node to be split) Maximum Depth property (maximum number of generation of nodes) Some Results  The posterior probabilities are the proportions of responders and non-responders at each node   A node is classified as a responder or nonresponder depending on which posterior probability is the largest In selecting the best tree, one can use Misclassification, Lift, or Average Squared Error Creating a Decision Tree Model in Enterprise Miner   Open the bankrupt project, and create a new diagram called Bankrupt_DecTree Drag and drop the bankrupt data node and the Decision Tree node (from the model tab) onto the diagram  Connect the nodes Select ProbChisq for the Criterion under Splitting Rule Change Use Input Once to Yes (otherwise, the same variable can appear more than once in the tree) Under Subtree select Misclassification for Assessment Measure Keep defaults under P-Value Adjustment and Output Variables Under Score set Variable Selection to No (otherwise variables with importance values greater than 0.05 are set as rejected and not considered by the tree) The Decision Tree has only one split on RE/TA. The misclassification rate is 0.15 (3/20), with 2 false negatives and 1 false positive. The cumulative lift is somewhat lower than the best cumulative lift, and starts out at 1.777 vs. the best value of 2.000. Under Subtree, set Method to Largest and rerun. The result show that another split is added, using EBIT/TA. However, the misclassification rate is unchanged at 0.15. This result shows that setting Method to Assessment and Misclassification for Assessment Measure finds the smallest tree having the lowest misclassification Model Comparison    The Model Comparison node under the Assess tab can be used to compare several different models Create a diagram called Full Model that includes the bankrupt data node connected into the regression, decision tree, and neural network nodes Connect the three model nodes into the Model Comparison node, and connect it and the bankrupt_score data node into a Score node For Regression, set Selection Model to none; for Neural Network, set Model Selection Criterion to Average Error, and the Network properties as before; for Decision Tree, set Assessment Measure as Average Squared Error, and the other properties as before. This puts each of the models on a similar basis for fit. For Model Comparison set Selection Criterion as Average Squared Error. Neural Network is selected, although Regression is nearly identical in average squared error. The Receiver Operating Characteristic (ROC) curve shows sensitivity (true positives) vs. 1-specificity (false positives) for various cutoff probabilities of a response. The chart shows that no matter what the cutoff probabilities are, regression and neural network classify 100% of responders as responders (sensitivity) and 0% of nonresponders as responders (1-specificity). Decision tree performs reasonably well, as indicated by the area above the diagonal line.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Decision Tree Models in Data Mining