Download Predictive Data Mining Modeling in Very Large Data Sets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Predictive Data Mining in Very Large Data
Sets: A Demonstration and Comparison
Under Model Ensemble
Dr. Hongwei “Patrick” Yang
Educational Policy Studies & Evaluation
College of Education
University of Kentucky
Lexington, KY
Presented at the 2014 Modern Modeling Methods conference
Overview


The study demonstrates predictive data
mining models under model ensemble
in the context of analyzing large data
Data mining is usually defined as the
data-driven process of discovering
meaningful hidden patterns in large
amounts of data through automatic as
well as manual means
Modern Modeling Methods Conference
2014
2
Overview


Many industries use data mining to
address business problems, such as
bankrupt prediction, risk management,
fraud detection, etc.
Such applications in data mining typically
take advantage of predictive data mining
models as learning machines with a
primary focus on making good predictions
Modern Modeling Methods Conference
2014
3
Overview

Among many types of predictive data mining
models are decision trees, neural networks,
and (traditional) regression models:



Decision tree: Identify the most significant
split of the outcome at each layer
Neural network: Model nonlinear associations
For each of the models/learning machines
presented above, the outcome can be either
a categorical one or a numerical one
Modern Modeling Methods Conference
2014
4
Overview


On the other hand, model ensemble
techniques have recently become popular
thanks to the growing power of
computation
Bagging and boosting are two of the most
popular ensemble techniques
Modern Modeling Methods Conference
2014
5
Overview


Model ensemble techniques are
designed to create a model
ensemble/committee containing
multiple component/base models
The committee of models are averaged
or pooled in a certain manner to
improve the stability and accuracy of
predictions
Modern Modeling Methods
Conference 2014
6
Overview


Model ensemble techniques can be
incorporated into many types of predictive
models/learning machines (tree, neural
network, regression, etc.)
Ensemble-based modeling can also be
combined with common feature/subset
selection procedures (genetic algorithm,
stepwise method, all-possible-subsets, etc.)
Modern Modeling Methods
Conference 2014
7
Numerical examples

To demonstrate the effectiveness of
predictive data mining in discovering
meaningful information from large data,
the study chooses the three types of
predictive models which are commonly
used, and analyzes them under two
large scale applications
Modern Modeling Methods
Conference 2014
8
Numerical examples


To further improve the predictions from
each type of model, model ensemble is
implemented during the modeling
process to pool predictions from
individual component model
For comparison purposes, all models
are also fitted without creating any
model ensemble
Modern Modeling Methods
Conference 2014
9
Numerical examples


Besides, the models are each evaluated
for goodness-of-fit and performance at
the final stage using various fit statistics
including average squared error, ROC
index, misclassification rate, Gini
coefficient, K-S statistic, as applicable
The entire analysis is performed under
SAS Enterprise Miner 7.1
Modern Modeling Methods
Conference 2014
10
Numerical examples

Example one: Physicochemical properties
of protein tertiary structure data



A numerical outcome: 45,730 cases
Example two: Bank marketing data
 A categorical outcome: 41,188 cases
Both data sets are retrieved from the UC
Irvine (UCI) Machine Learning Repository
Modern Modeling Methods
Conference 2014
11
Example one: Numerical outcome
Modern Modeling Methods
Conference 2014
12
Example one: Numerical outcome
Table 1. Comparison of Models based on Training Data under a Numerical Outcome.
Model Description
Average Squared
Error
Root Average
Squared Error
Maximum
Absolute Error
EnRegTreeNN
21.338
4.619
15.000
EnReg
22.874
4.783
14.818
EnNN
23.122
4.809
16.556
EnTree
25.193
5.019
16.131
NN
23.591
4.857
19.663
Reg
23.574
4.855
19.668
Tree
24.103
4.910
17.412
Modern Modeling Methods
Conference 2014
13
Example one: Numerical outcome

Ensemble models tend to be more
effective in reducing errors, although it
is not guaranteed



Average squared error: Lower is better
Root average squared error: Lower is
better
Maximum absolute error: Lower is better
Modern Modeling Methods
Conference 2014
14
Example two: Categorical outcome
Modern Modeling Methods
Conference 2014
15
Example two: Categorical outcome
Table 2. Comparison of Models based on Training Data under a Categorical Outcome.
Model
Root
Misclassification Roc
Gini
Kolmogorov
Description
Average
Rate
Index Coefficient -Smirnov
Squared
Statistic
Error
EnRegTreeNN
EnReg
EnNN
EnTree
Tree
NN
Reg
0.237
0.241
0.252
0.270
0.254
0.261
0.261
0.078
0.081
0.086
0.101
0.090
0.098
0.097
0.947
0.935
0.919
0.801
0.900
0.912
0.912
0.894
0.871
0.838
0.602
0.800
0.823
0.823
0.780
0.719
0.682
0.579
0.697
0.675
0.668
Bin-Based
Gain
Two-Way
Kolmogorov
-Smirnov
Statistic
0.772
0.717
0.681
0.576
0.692
0.670
0.666
Modern Modeling Methods
Conference 2014
504.305
455.744
428.767
395.325
441.595
400.087
408.710
Cumulative
Lift
6.043
5.557
5.288
4.953
5.416
5.001
5.087
Cumulative
Percent
Captured
Response
60.541
55.676
52.973
49.623
54.179
50.027
50.889
16
Example two: Categorical outcome

Ensemble models typically have better
discriminatory power among all models,
as is indicated by each criterion






Misclassification rate: Lower is better
ROC index: Higher is better
Gini coefficient: Higher is better
K-S statistic: Higher is better
Cumulative lift: Higher is better
Cumulative percent captured response: Higher is better
Modern Modeling Methods
Conference 2014
17
Conclusions


The study presents some initial evidence for
the effectiveness of model ensemble in
improving the performance of an individual
learning machine (model) under a given type
The study needs to be supplemented with
additional information on the use of (real)
bagging and boosting in improving the
performance of individual learning machine
Modern Modeling Methods
Conference 2014
18
Conclusions


The study provides applied researchers with
more options beyond traditional regression
modeling when reliable predictions are
needed in their research
The study serves as the foundation for a
future research topic which adds feature
selection to predictive data mining modeling
under model ensemble for analyzing very
large data sets
Modern Modeling Methods Conference
2014
19
References
Ao, S. (2008). Data mining and applications in Genomics. Berlin, Heidelberg, Germany: Springer Science+Business Media.
Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of
California, School of Information and Computer Science.
Barutcuoglu, Z., & Alpaydin, E. (2003). A comparison of model aggregation methods for regression. In O. Kaynak, E.
Alpaydin, E. Oja, & L. Xu. (Eds.), Artificial Neural Networks and Neural Information Processing - ICANN/ICONIP 2003
(pp. 76–83). NYC, NY: Springer.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
Cerrito, P. B. (2006). Introduction to data mining: Using SAS Enterprise Miner. Cary, NC: SAS Institute Inc.
Drucker, H. (1997). Improving regressor using boosting techniques. Proceedings of the 14th International Conferences on
Machine Learning, 107-115.
Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121, 256-285.
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the
Thirteenth International Conference, 148-156.
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting.
Journal of Computer and System Sciences, 55, 119-139.
Hill, C. M., & Malone, L. C., & Trocine, L. (2004). Data mining and traditional regression. In H. Bozdogan (Ed.), Statistical
data mining and knowledge discovery, (pp. 233-249). London, UK: Chapman and Hall/CRC.
Larose, D. T. (2005). Discovering knowledge in data: An introduction to data mining. Hoboken, NJ: John Wiley & Sons,
Inc.
Liu, B., Cui, Q., Jiang, T., & Ma, S. (2004). A combinational feature selection and ensemble neural network method for
classification of gene expression data. BMC Bioinformatics, 5, 136.
Oza, N. C. (2005). Ensemble Data Mining Methods. In J. Wang (Ed.), Encyclopedia of Data Warehousing and Mining (pp.
448-453). Hershey, PA: Information Science Reference.
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197-227.
Schapire, R. E.. (2002). The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. C.
Holmes, B. Mallick, & B. Yu (Eds.), MSRI workshop on nonlinear estimation and classification. NYC, NY: Springer.
Modern Modeling Methods Conference
2014
20