Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Theoretical Understanding of Estimation Tasks in Data Mining Vinh Ngo University of Houston – Clear Lake Mike Ellis University of Houston – Clear Lake ABSTRACT Linear regression is used in data mining applications to create, simple, understandable prediction models. More complex models require methods such as regression trees, model trees, or neural networks. An important aspect of using these models is to determine which one is appropriate for the current task. Part of making this determination is the accuracy of each method’s estimation and how that accuracy will be determined. Keywords: data mining, estimation, regression, decision tree, regression tree, neural network INTRODUCTION Managers have always looked for better ways to conduct their business’ operations. Data mining has given them tools to help dig through years of computerized data to find the groupings and trends that were previously impossible to discover. How their customers, products, sales, and other business dimensions are associated can give them insight into how the business runs. Spotting trends and predicting future data based upon past experience can give them insight into what they should do going into the future. When we look to predict values that are not in predetermined categories, we use estimation. Estimation, referred to also under the umbrella term of “regression”, uses mathematical techniques to make predictions when the set of possible values is continuous, like future stock prices or the expected performance level of a computer based upon its components. To effectively use these data mining tools we must have a 1 basic understanding of the techniques underlying them. In the case of estimation, the three main techniques used are regression, decision trees, and neural networks. REGRESSION The simplest form of regression is linear regression. Simple linear regression is used in a bivariate case (i.e., when only two variables are involved). It models the response variable Y based upon the predictor variable X in the form of a linear function Y X where α and β are regression coefficients that specify the Y-intercept and the slope of the regression line, respectively. The regression coefficients can be calculated using the method of least squares. Using this method fits the line to the actual data by minimizing the error between the two. Given a quantity of data points s of the form (x1,y1), (x2,y2),…,(xs,ys) and calculating x and y as the average of the x and y values, then β and α are calculated using the following equations: s (x i 1 i x )( y i y ) and s (x i 1 i x) y x. [5] 2 One big advantage of linear regression is that it is a relatively easy technique to use. It can easily be done in Excel, as regression is one of the “Data Analysis” functions. An even simpler solution can be found using Excel’s charting capability, as the following example illustrates. Table 1 shows a set of data where X is a college graduate’s years of work experience and Y is their salary. [5] The Excel chart below shows the X-Y scatter plot of the two variables and the resulting regression equation. 2 Table 1 - Experience vs. Salary [5] Years of Salary experience ($1000s) Experience and Salary Y 120 3 30 100 8 57 9 64 13 72 3 36 6 43 11 59 20 21 90 0 1 20 16 83 Salary ($1000s) X y = 3.5375x + 23.209 80 60 40 0 5 10 15 20 Years of Experience Of course, many problems can not be described with a bivariate model. If more than one predictor variable is used, then simple linear regression can be extended by using multiple regression. With two predictor variables, X1 and X2, the multiple regression model is Y 1 X 1 2 X 2 . A polynomial relationship can also be modeled by the linear regression model. By transforming the polynomial variables into new linear variables, the method of least squares can be used to generate a regression equation. For example, an equation of the form Y 1 X 2 X 2 3 X 3 can be converted to linear form by defining the new variables X1 = X, X2 = X2, and X3 = X3. The equation then becomes Y 1 X1 2 X 2 3 X 3 , which can be solved using the least squares method. [5] Simple linear regression models are an attractive method to use because they are so simple to setup and interpret. The predictions that come from them are often surprisingly accurate. However, not every problem can be expressed as a straight line 3 25 model. More robust models are needed to reflect the complexity of many real-world problems, e.g., financial time series prediction. But the linear regression algorithm provides a solid foundation upon which to build more complex models. DECISION TREES While normally used for classifying discrete data, decision trees can also be used to estimate continuous values. When the leaves of the tree contain the average values of the training data points that traverse the tree to that leaf, the decision tree is referred to as a regression tree. A model tree consists of leaves that contain linear regression models to estimate the values of data points that reach the leaves. The method of growing regression and model trees is the same. [7] Continuous data is made viable in the tree structure by converting it to discrete value ranges, or discretizing the data. The numeric data must be expressed in Boolean terms to make discrete decisions possible so the same “divide and conquer” method of growing the tree can be used as in the classification problem. Nominal values must be chosen that partition the continuous data at what are called threshold values. These threshold values allow the continuous data to be expressed in terms of values that are “less than the threshold” (or greater than) and “all other values.” [4] Threshold values are chosen to maximize information gain, and information gain is based upon entropy. Entropy is a measure of the purity of a collection of data. If the target attribute takes on c different values, then the entropy for a collection S is c Entropy( S ) pi log 2 pi [4] i 1 where pi is the proportion S belonging to class i. The information gain is then defined as the expected reduction in entropy caused by partitioning the training data. If attribute A is used to partition collection S, the gain is Gain( S , A) Entropy( S ) v Value( A) |Sv | |S | Entropy( Sv ) . The second term is the expected value of the entropy after collection S is partitioned using the attribute A. It is the weighted sum of the entropy of the subsets. Therefore, the gain reflects the entropy lost by partitioning. By choosing the partition attribute A that 4 maximizes information gain, we choose the best threshold value for partitioning the continuous data. Once the partition points are developed, the tree’s growth proceeds according to a predetermined growth algorithm. The Classification and Regression Tree (CART) algorithm is the most popular method used for regression trees. CART methodology first grows a regression tree that overfits the data. The algorithm then prunes the tree and selects a subtree which is the best estimate of the target regression function. [3] Using decision trees to estimate future values of continuous data does have drawbacks. It is a technique originally developed for and best used with discrete data It can take large amounts of computation time to grow and prune the tree in training sets of any size. But it also generates understandable rules to follow toward an expected value, and provides an indication of which fields are most important for estimation. [3,4] NEURAL NETWORKS A more complex technique for data estimation is constructing a neural network. Neural networks are composed of nodes, where calculations are carried out, and links, which provide the connections between nodes. As their name implies they are modeled after the functioning of the human brain. Nodes are found in one of three types of layers in the neural net model. The input layer contains the nodes that accept A simple neural network [2] the predictor variable values to be Prediction input to the model. The output layer Output Layer presents the results of the model to the user. Between these two layers 3.6 Hidden Layer that are visible to the user are 2.0 usually one or more hidden layers. 3.5 1.9 The nodes in the hidden layers 0.4 0.0 0.8 perform intermediate calculations Age Gender Income Input Layer and communicate only with other nodes. Research has shown that one hidden layer is usually sufficient, although for some problems multiple hidden layers may be used. [2] 5 The neural network is built through a recursive process. Initial weights are assigned to each of the nodes and the links between the nodes are defined. The input nodes are presented with many values of the predictor variables from the training set and these inputs are run through the network. The actual value from these same records is known, and is compared to the estimated value. Through backpropagation the error is passed back through the hidden layers and to the input layers. If the prediction at a node is incorrect, then the nodes that had the most influence on making that decision have their weights modified to reduce the chance of an error the next time. In this way the model improves its accuracy incrementally. Backpropagation is the most common method used to make adjustments in the node weights, but there are others. Recurrent networks connect the output back into the model in the hidden layers. Genetic algorithms are also used to optimize weights. They simulate natural evolution by allowing successful nodes to reproduce with slight variations. Simulated annealing uses the annealing process as it is applied to metals as a model to perform weight optimization. Annealing makes large changes early in the training process, then decreases the rate of change as it approaches a solution. [2] Neural networks have both advantages and disadvantages for use in data mining. They produce very accurate predictions, are relatively fast to use, and they handle missing or corrupt data well. On the other hand, they are far less intuitive than other models, they don’t handle large numbers of predictor variables well, and they require a great deal of data preprocessing. ACCURACY OF MODELS With any kind of estimating or prediction model, the accuracy of the result is of great importance. A model is of no use if its predictions are inaccurate. While each of these methods is quite different in the way the estimation is arrived upon, the accuracy of each of the models can be determined in a similar manner. One potential problem common to all of the data mining methods is overfitting. Regression equations, trees, and neural nets may be developed that precisely conform to the training set yet produce inaccurate results with real data. This can be combated by 6 using an independent test data set to check the model’s performance on data other than that in the training set. Errors and error rates can be discussed in the context of each of these methods, but for determining model accuracy they are inadequate. We know there will be some error. The problem is to determine the relative size of the error and whether it is acceptable. Several familiar statistical measures (shown in Table 2 on page 8) are used to this purpose. The most commonly used measure is the mean-squared error. The three terms with the word “relative” in their names describe the error in terms relative to the error between the actual values and the average of the actual values. The correlation coefficient provides a statistical correlation between the actual values and the predicted values. Which measure is most appropriate depends upon the situation within which it is used. Fortunately, in most cases the best estimation method remains the best regardless of the error measure used. [7] CONCLUSION Estimation of future values can be accomplished using any of the three techniques discussed. Whether to use regression, decision trees, or neural networks should be determined in the context of the data mining problem at hand. How accurate does the estimate need to be? How complex is the data? How much time (and money) can we justify spending to get an estimate? All of these questions must be answered to get the proper perspective on the problem and decide which data mining estimation technique is most appropriate for a specific application. 7 Table 2 - Statistical measures (a = actual values, p = predicted values) mean-squared error n ( p ai ) 2 i i 1 n root mean squared error n ( p a ) i i 1 2 i n mean absolute error n p i i 1 ai n relative squared error root relative squared error relative absolute error correlation coefficient ( pi ai ) 2 2 i 1 ( ai a ) n where a a i i n ( pi ai ) 2 2 i 1 ( ai a ) n n pi ai i 1 ai a S PA , where SP S A S PA SA i( pi p)(ai a ) n 1 (a i a )2 n 1 8 , SP (p i p) 2 n 1 , and REFERENCES 1. BaseGroup Lab, “Decision trees -- general principles of operation”, online at http://www.basegroup.ru/trees/description.en.htm. 2. Berson, Alex and Smith, Stephen J., Data Warehousing, Data Mining, & OLAP, McGraw-Hill, 1997. 3. Boetticher, Gary, “Lecture Notes: Unit 3 – Machine Learners, Decision Trees”, University of Houston-Clear Lake, 2006. 4. Gamberger, Dragan, et. al., “DMS Tutorial – Decision Trees”, Laboratory for Information Systems, Department of Electronics, Rudjer Boskovic Institute, Croatia, online at http://dms.irb.hr/tutorial/tut_dtrees.php. 5. Han, Jiawei, and Kamber, Micheline, Data Mining: Concepts and Techniques, Morgan Kaufman Publishers, San Francisco, CA, 2001. 6. Hand, David, et. al., Principles of Data Mining, The MIT Press, Cambridge, MA, 2001. 7. Witten, Ian H., and Frank, Eibe, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufman Publishers, San Francisco, CA, 2000. 9