Download Microsoft Office Word - RobOpara - UHCL MIS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression toward the mean wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Theoretical Understanding of Estimation Tasks in Data Mining
Vinh Ngo University of Houston – Clear Lake
Mike Ellis University of Houston – Clear Lake
ABSTRACT
Linear regression is used in data mining applications to create, simple,
understandable prediction models. More complex models require methods such as
regression trees, model trees, or neural networks. An important aspect of using these
models is to determine which one is appropriate for the current task. Part of making this
determination is the accuracy of each method’s estimation and how that accuracy will be
determined.
Keywords: data mining, estimation, regression, decision tree, regression tree,
neural network
INTRODUCTION
Managers have always looked for better ways to conduct their business’
operations. Data mining has given them tools to help dig through years of computerized
data to find the groupings and trends that were previously impossible to discover. How
their customers, products, sales, and other business dimensions are associated can give
them insight into how the business runs. Spotting trends and predicting future data based
upon past experience can give them insight into what they should do going into the
future.
When we look to predict values that are not in predetermined categories, we use
estimation. Estimation, referred to also under the umbrella term of “regression”, uses
mathematical techniques to make predictions when the set of possible values is
continuous, like future stock prices or the expected performance level of a computer
based upon its components. To effectively use these data mining tools we must have a
1
basic understanding of the techniques underlying them. In the case of estimation, the
three main techniques used are regression, decision trees, and neural networks.
REGRESSION
The simplest form of regression is linear regression. Simple linear
regression is used in a bivariate case (i.e., when only two variables are involved). It
models the response variable Y based upon the predictor variable X in the form of a
linear function
Y    X
where α and β are regression coefficients that specify the Y-intercept and the slope of the
regression line, respectively.
The regression coefficients can be calculated using the method of least squares.
Using this method fits the line to the actual data by minimizing the error between the two.
Given a quantity of data points s of the form (x1,y1), (x2,y2),…,(xs,ys) and calculating x
and y as the average of the x and y values, then β and α are calculated using the following
equations:
s

 (x
i 1
i
 x )( y i  y )
and
s
 (x
i 1
i
 x)
  y x.
[5]
2
One big advantage of linear regression is that it is a relatively easy technique to
use. It can easily be done in Excel, as regression is one of the “Data Analysis” functions.
An even simpler solution can be found using Excel’s charting capability, as the following
example illustrates.
Table 1 shows a set of data where X is a college graduate’s years of work
experience and Y is their salary. [5] The Excel chart below shows the X-Y scatter plot of
the two variables and the resulting regression equation.
2
Table 1 - Experience vs. Salary [5]
Years of
Salary
experience
($1000s)
Experience and Salary
Y
120
3
30
100
8
57
9
64
13
72
3
36
6
43
11
59
20
21
90
0
1
20
16
83
Salary ($1000s)
X
y = 3.5375x + 23.209
80
60
40
0
5
10
15
20
Years of Experience
Of course, many problems can not be described with a bivariate model. If more
than one predictor variable is used, then simple linear regression can be extended by
using multiple regression. With two predictor variables, X1 and X2, the multiple
regression model is
Y    1 X 1   2 X 2 .
A polynomial relationship can also be modeled by the linear regression model.
By transforming the polynomial variables into new linear variables, the method of least
squares can be used to generate a regression equation. For example, an equation of the
form
Y    1 X   2 X 2   3 X 3
can be converted to linear form by defining the new variables
X1 = X, X2 = X2, and X3 = X3.
The equation then becomes
Y    1 X1   2 X 2  3 X 3 ,
which can be solved using the least squares method. [5]
Simple linear regression models are an attractive method to use because they are
so simple to setup and interpret. The predictions that come from them are often
surprisingly accurate. However, not every problem can be expressed as a straight line
3
25
model. More robust models are needed to reflect the complexity of many real-world
problems, e.g., financial time series prediction. But the linear regression algorithm
provides a solid foundation upon which to build more complex models.
DECISION TREES
While normally used for classifying discrete data, decision trees can also be used
to estimate continuous values. When the leaves of the tree contain the average values of
the training data points that traverse the tree to that leaf, the decision tree is referred to as
a regression tree. A model tree consists of leaves that contain linear regression models
to estimate the values of data points that reach the leaves. The method of growing
regression and model trees is the same. [7]
Continuous data is made viable in the tree structure by converting it to discrete
value ranges, or discretizing the data. The numeric data must be expressed in Boolean
terms to make discrete decisions possible so the same “divide and conquer” method of
growing the tree can be used as in the classification problem. Nominal values must be
chosen that partition the continuous data at what are called threshold values. These
threshold values allow the continuous data to be expressed in terms of values that are
“less than the threshold” (or greater than) and “all other values.” [4]
Threshold values are chosen to maximize information gain, and information gain
is based upon entropy. Entropy is a measure of the purity of a collection of data. If the
target attribute takes on c different values, then the entropy for a collection S is
c
Entropy( S )    pi log 2 pi
[4]
i 1
where pi is the proportion S belonging to class i. The information gain is then defined as
the expected reduction in entropy caused by partitioning the training data. If attribute A
is used to partition collection S, the gain is
Gain( S , A)  Entropy( S ) 

v  Value( A)
|Sv |
|S |
Entropy( Sv ) .
The second term is the expected value of the entropy after collection S is partitioned
using the attribute A. It is the weighted sum of the entropy of the subsets. Therefore, the
gain reflects the entropy lost by partitioning. By choosing the partition attribute A that
4
maximizes information gain, we choose the best threshold value for partitioning the
continuous data.
Once the partition points are developed, the tree’s growth proceeds according to a
predetermined growth algorithm. The Classification and Regression Tree (CART)
algorithm is the most popular method used for regression trees. CART methodology first
grows a regression tree that overfits the data. The algorithm then prunes the tree and
selects a subtree which is the best estimate of the target regression function. [3]
Using decision trees to estimate future values of continuous data does have
drawbacks. It is a technique originally developed for and best used with discrete data It
can take large amounts of computation time to grow and prune the tree in training sets of
any size. But it also generates understandable rules to follow toward an expected value,
and provides an indication of which fields are most important for estimation. [3,4]
NEURAL NETWORKS
A more complex technique for data estimation is constructing a neural network.
Neural networks are composed of nodes, where calculations are carried out, and links,
which provide the connections between nodes. As their name implies they are modeled
after the functioning of the human brain.
Nodes are found in one of three types of layers in the neural net model. The input
layer contains the nodes that accept
A simple neural network [2]
the predictor variable values to be
Prediction
input to the model. The output layer
Output Layer
presents the results of the model to
the user. Between these two layers
3.6
Hidden Layer
that are visible to the user are
2.0
usually one or more hidden layers.
3.5
1.9
The nodes in the hidden layers
0.4
0.0
0.8
perform intermediate calculations
Age
Gender
Income
Input Layer
and communicate only with other nodes. Research has shown that one hidden layer is
usually sufficient, although for some problems multiple hidden layers may be used. [2]
5
The neural network is built through a recursive process. Initial weights are
assigned to each of the nodes and the links between the nodes are defined. The input
nodes are presented with many values of the predictor variables from the training set and
these inputs are run through the network. The actual value from these same records is
known, and is compared to the estimated value. Through backpropagation the error is
passed back through the hidden layers and to the input layers. If the prediction at a node
is incorrect, then the nodes that had the most influence on making that decision have their
weights modified to reduce the chance of an error the next time. In this way the model
improves its accuracy incrementally.
Backpropagation is the most common method used to make adjustments in the
node weights, but there are others. Recurrent networks connect the output back into the
model in the hidden layers. Genetic algorithms are also used to optimize weights. They
simulate natural evolution by allowing successful nodes to reproduce with slight
variations. Simulated annealing uses the annealing process as it is applied to metals as a
model to perform weight optimization. Annealing makes large changes early in the
training process, then decreases the rate of change as it approaches a solution. [2]
Neural networks have both advantages and disadvantages for use in data mining.
They produce very accurate predictions, are relatively fast to use, and they handle
missing or corrupt data well. On the other hand, they are far less intuitive than other
models, they don’t handle large numbers of predictor variables well, and they require a
great deal of data preprocessing.
ACCURACY OF MODELS
With any kind of estimating or prediction model, the accuracy of the result is of
great importance. A model is of no use if its predictions are inaccurate. While each of
these methods is quite different in the way the estimation is arrived upon, the accuracy of
each of the models can be determined in a similar manner.
One potential problem common to all of the data mining methods is overfitting.
Regression equations, trees, and neural nets may be developed that precisely conform to
the training set yet produce inaccurate results with real data. This can be combated by
6
using an independent test data set to check the model’s performance on data other than
that in the training set.
Errors and error rates can be discussed in the context of each of these methods,
but for determining model accuracy they are inadequate. We know there will be some
error. The problem is to determine the relative size of the error and whether it is
acceptable.
Several familiar statistical measures (shown in Table 2 on page 8) are used to this
purpose. The most commonly used measure is the mean-squared error. The three terms
with the word “relative” in their names describe the error in terms relative to the error
between the actual values and the average of the actual values. The correlation
coefficient provides a statistical correlation between the actual values and the predicted
values. Which measure is most appropriate depends upon the situation within which it is
used. Fortunately, in most cases the best estimation method remains the best regardless
of the error measure used. [7]
CONCLUSION
Estimation of future values can be accomplished using any of the three techniques
discussed. Whether to use regression, decision trees, or neural networks should be
determined in the context of the data mining problem at hand. How accurate does the
estimate need to be? How complex is the data? How much time (and money) can we
justify spending to get an estimate? All of these questions must be answered to get the
proper perspective on the problem and decide which data mining estimation technique is
most appropriate for a specific application.
7
Table 2 - Statistical measures (a = actual values, p = predicted values)
mean-squared error
n
( p
 ai ) 2
i
i 1
n
root mean squared error
n
( p  a )
i
i 1
2
i
n
mean absolute error
n
p
i
i 1
 ai
n
relative squared error
root relative squared error
relative absolute error
correlation coefficient
( pi  ai ) 2

2
i 1 ( ai  a )
n
where a 
a
i i
n
( pi  ai ) 2

2
i 1 ( ai  a )
n
n
pi  ai
i 1
ai  a

S PA
, where
SP S A
S PA 
SA
i( pi  p)(ai  a )
n 1
 (a

i
 a )2
n 1
8
,
SP
 (p

i
 p) 2
n 1
, and
REFERENCES
1. BaseGroup Lab, “Decision trees -- general principles of operation”, online at
http://www.basegroup.ru/trees/description.en.htm.
2. Berson, Alex and Smith, Stephen J., Data Warehousing, Data Mining, & OLAP,
McGraw-Hill, 1997.
3. Boetticher, Gary, “Lecture Notes: Unit 3 – Machine Learners, Decision Trees”,
University of Houston-Clear Lake, 2006.
4. Gamberger, Dragan, et. al., “DMS Tutorial – Decision Trees”, Laboratory for
Information Systems, Department of Electronics, Rudjer Boskovic Institute,
Croatia, online at http://dms.irb.hr/tutorial/tut_dtrees.php.
5. Han, Jiawei, and Kamber, Micheline, Data Mining: Concepts and Techniques,
Morgan Kaufman Publishers, San Francisco, CA, 2001.
6. Hand, David, et. al., Principles of Data Mining, The MIT Press, Cambridge, MA,
2001.
7. Witten, Ian H., and Frank, Eibe, Data Mining: Practical Machine Learning Tools
and Techniques with Java Implementations, Morgan Kaufman Publishers, San
Francisco, CA, 2000.
9