Download data mining and knowledge discovery handbook

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
DATA MINING AND
KNOWLEDGE
DISCOVERY HANDBOOK
© 2008 AGI-Information Management Consultants
May be used for personal purporses only or by
libraries associated to dandelon.com network.
edited by
Oded Maimon and Lior Rokach
Tel-Aviv University, Israel
*£j Springer
Contents
Dedication
Contributing Authors
Preface
Acknowledgments
1
Introduction to Knowledge Discovery in Databases
Oded Maimon and Lior Rokach
1.
The KDD Process
2.
Taxonomy of Data Mining Methods
3.
Data Mining within the Complete Decision Support System
4.
KDD & DM Research Opportunities and Challenges
5.
KDD & DM Trends
6.
The Organization of the Handbook
7.
References Principles
Part I
ii
xxi
xxxiii
xxxv
1
2
6
8
9
11
12
13
Preprocessing Methods
2
Data Cleansing
Jonathan I. Maletic and Andrian Marcus
1.
INTRODUCTION
2.
DATA CLEANSING BACKGROUND
3.
GENERAL METHODS FOR DATA CLEANSING
4.
APPLYING DATA CLEANSING
5.
CONCLUSIONS
References
3
Handling Missing Attribute Values
Jerzy W. Grzymala-Busse and Witold J. Grzymala-Busse
1.
Introduction
2.
Sequential Methods
3.
Parallel Methods
4.
Conclusions
References
21
21
22
26
27
33
33
37
37
39
48
53
53
vi
DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
4
Geometric Methods for Feature Extraction and Dimensional Reduction
Christopher J.C. Burges
1.
Projective Methods
2.
Manifold Modeling
3.
Pulling the Threads Together
61
73
86
Acknowledgments
References
88
89
5
Dimension Reduction and Feature Selection
Barak Chizi and Oded Maimon
1.
Introduction
2.
Feature Selection Techniques
3.
Variable Selection
References
6
Discretization Methods
Ying Yang, Geoffrey I. Webb and Xindong Wu
1.
Terminology
2.
Taxonomy
3.
Typical methods
4.
Discretization and the learning context
5.
Summary
References
7
Outlier Detection
had Ben-Gal
1.
Introduction: Motivation, Definitions and Applications
2.
Taxonomy of Outlier Detection Methods
3.
Univariate Statistical Methods
4.
Multivariate Outlier Detection
5.
Comparison of Outlier Detection Methods
References
Part II
59
93
93
96
106
109
113
114
116
118
125
127
128
131
131
132
133
137
141
142
Supervised Methods
8
Introduction to Supervised Methods
Oded Maimon and Lior Rokach
1.
Introduction
2.
Training Set
3.
Definition of the Classification Problem
4.
Induction Algorithms
5.
Performance Evaluation
6.
Scalability to Large Datasets
149
149
150
151
152
152
158
Contents
7.
The "Curse of Dimensionality"
8.
Classification Problem Extensions
References
9
Decision Trees
Lior Rokach and Oded Maimon
Decision Trees
1.
2.
Algorithmic Framework for Decision Trees
3.
Univariate Splitting Criteria
Multivariate Splitting Criteria
4.
Stopping Criteria
5.
Pruning Methods
6.
Other Issues
7.
Decision Trees Inducers
8.
9.
Advantages and Disadvantages of Decision Trees
10.
Decision Tree Extensions
References
10
Bayesian Networks
Paola Sebastiani, Maria M. Abad and Marco F. Ramoni
Introduction
1.
2.
Representation
Reasoning
3.
Learning
4.
Bayesian Networks in Data Mining
5.
Data Mining Applications
6.
Conclusions and Future Research Directions
7.
Acknowledgments
References
vii
159
161
162
165
165
167
168
174
174
175
179
181
183
185
187
193
193
195
198
200
211
218
223
226
226
11
Data Mining within a Regression Framework
Richard A. Berk
1.
Introduction
2.
Some Definitions
3.
Regression Splines
4.
Smoothing Splines
5.
Locally Weighted Regression as a Smoother
6.
Smoothers for Multiple Predictors
7.
Recursive Partitioning
8.
Conclusions
231
232
234
236
238
239
242
252
Acknowledgments
References
252
253
231
viii
DATA
MINING
AND
KNOWLEDGE
DISCOVERY
HANDBOOK
12
Support Vector Machines
Armin Shmilovici
1.
Introduction
2.
Hyperplane Classifiers
3.
Non-Separable SVM Models
4.
Implementation Issues with SVM
5.
Extensions and Application
6.
Conclusion
References
13
Rule Induction
Jerzy W. Grzymala-Busse
1.
Introduction
2.
Types of Rules
3.
Rule Induction Algorithms
4.
Classification Systems
5.
Validation
6.
Advanced Methodology
References
257
257
259
264
269
272
273
273
277
277
279
281
291
292
293
293
Part III Unsupervised Methods
14
Visualization and Data Mining for High Dimensional Datasets
Alfred Inselberg
1. Do it in Parallel!
2.
Visual Data Mining - A Case Study
3.
Visual and Computational Models
References
15
Clustering Methods
Lior Rokach and Oded Maimon
1.
Introduction
2.
Distance Measures
3.
Similarity Functions
4.
Evaluation Criteria Measures
5.
Clustering Methods
6.
Clustering Large Data Sets
7. Determining the Number of Clusters
References
16
Association Rules
Frank Hoppner
1.
Introduction
2.
Association Rule Mining
297
298
304
316
318
321
321
322
325
326
330
342
346
349
353
353
356
Contents
ix
3.
Application to Other Types of Data
4.
Extensions of the Basic Framework
Conclusions
5.
References
17
Frequent Set Mining
Bart Goethals
1.
Problem Description
2.
Apriori
Eclat
3.
4.
Optimizations
5.
Concise representations
6.
Theoretical Aspects
7.
Further Reading
References
18
Constraint-based Data Mining
Jean-Francois Boulicaut and Baptiste Jeudy
1.
Motivations
2.
Background and Notations
3.
Solving Anti-Monotonic Constraints
4.
Introducing non Anti-Monotonic Constraints
5.
Conclusion
References
19
Link Analysis
Steve Donoho
1.
Introduction
2.
Social Network Analysis
3.
Search Engines
4.
Viral Marketing
5.
Law Enforcement & Fraud Detection
6.
Combining with Traditional Methods
7.
Summary
References
Part IV
362
364
372
373
377
378
381
384
386
388
391
392
393
399
399
402
404
406
413
414
417
417
419
422
424
426
428
430
430
Soft Computing Methods
20
Evolutionary Algorithms for Data Mining
Alex A. Freitas
1.
Introduction
2.
An Overview of Evolutionary Algorithms
3.
Evolutionary Algorithms for Discovering Classification Rules
4.
Evolutionary Algorithms for Clustering
435
435
436
442
447
x
DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
5.
Evolutionary Algorithms for Data Preprocessing
6.
Multi-Objective Optimization with Evolutionary Algorithms
7.
Conclusions
References
21
Reinforcement-Learning: an Overview from a Data Mining Perspective
Shahar Cohen and Oded Maimon
1.
Introduction
2.
The Reinforcement-Learning Model
3.
Reinforcement-Learning Algorithms
4.
Extensions to Basic Model and Algorithms
5.
Applications of Reinforcement-Learning
6.
Reinforcement-Learning and Data-Mining
7.
An Instructive Example
References
22
Neural Networks
Peter G. Zhang
1.
Introduction
2.
A Brief History
3.
Neural Network Models
4.
Data Mining Applications
5.
Conclusions
References
23
On the use of Fuzzy Logic in Data Mining
Joseph Komem and Moti Schneider
1.
Introduction
2.
Fuzzy Sets and Fuzzy Logic
3.
Soft Regression
4.
Fuzzy Association Rules
5.
Conclusions
References
24
Granular Computing and Rough Sets
Tsau Young ('T. Y.') Lin and Churn-Jung Liau
1.
Introduction
2.
Naive Model for Problem Solving
3.
A Geometric Models of Information Granulations
4.
Information Granulations/Partitions
5.
Non-partition Application - Chinese Wall Security Policy Model
6.
Knowledge Representations
7.
Topological Concept Hierarchy Lattices/Trees
8.
Knowledge Processing
450
456
459
461
469
469
470
472
476
478
479
480
485
487
487
488
490
506
508
508
517
517
518
522
525
532
532
535
535
536
538
540
541
543
549
553
Contents
xi
9. '
Information Integration
10.
Conclusions
References
Part V
556
558
558
Supporting Methods
25
Statistical Methods for Data Mining
Yoav Benjamini and Moshe Leshno
1.
Introduction
2.
Statistical Issues in DM
3.
Modeling Relationships using Regression Models
4.
False Discovery Rate (FDR) Control in Hypotheses Testing
5.
Model (Variables or Features) Selection using FDR Penalization in
GLM
6.
Concluding Remarks
References
26
Logics for Data Mining
PetrHdjek
1.
Generalized quantifiers
2.
Some important classes of quantifiers
3.
Some comments and conclusion
Acknowledgments
References
27
Wavelet Methods in Data Mining
Too Li, Sheng Ma and Mitsunori Ogihara
1.
Introduction
A Framework for Data Mining Process
2.
3.
Wavelet Background
4.
Data Management
Preprocessing
5.
6.
Core Mining Process
7.
Conclusion
References
28
Fractaltvlining
Daniel Barbara and Ping Chen
1.
Introduction
2.
Fractal Dimension
Clustering Using the Fractal Dimension
3.
4.
Projected Fractal Clustering
5.
Tracking Clusters
6.
Conclusions
565
565
567
573
578
582
584
585
589
590
593
598
599
599
603
604
604
605
610
611
614
622
623
627
628
629
633
641
642
645
xii
DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
References
29
Interestingness Measures
Sigal Sahar
1.
Definitions and Notations
2.
Subjective Interestingness
3.
Objective Interestingness
4.
Impartial Interestingness
5.
Concluding Remarks
References
30
Quality Assessment Approaches in Data Mining
Maria Halkidi and Michalis Vazirgiannis
1.
Data Pre-processing and Quality Assessment
2.
Evaluation of Classification Methods
3.
Association Rules
4.
Cluster Validity
References
31
Data Mining Model Comparison
Paolo Giudici
1.
Data Mining and Statistics
2.
Data Mining Model Comparison
3.
Application to Credit Risk Management
4.
Conclusions
References
32
Data Mining Query Languages
Jean-Francois Boulicaut and Cyrille Masson
1.
The Need for Data Mining Query Languages
2.
Supporting Association Rule Mining Processes
3.
A Few Proposals for Association Rule Mining
4.
Conclusion
References
Part VI
645
649
650
651
652
655
656
657
661
663
664
671
675
694
697
697
698
704
712
714
715
715
717
719
725
726
Advanced Methods
33
Meta-Learning
Ricardo Vilalta, Christophe Giraud-Carrier and Pavel Brazdil
1.
Introduction
2.
A Meta-Learning Architecture
3.
Techniques in Meta-Learning
4.
Tools and Applications
731
731
733
737
743
Contents
5.
Future Directions and Conclusions
References
34
Bias vs Variance Decomposition for Regression and Classification
Pierre Geurts
1.
Introduction
2.
Bias/Variance Decompositions
3.
Estimation of Bias and Variance
4.
Experiments and Applications
5.
Discussion
References
35
Mining with Rare Cases
GaryM. Weiss
1.
Introduction
2.
Why Rare Cases are Problematic
3.
Techniques for Handling Rare Cases
4.
Conclusion
References
36
Mining Data Streams
Haixun Wang, Philip S. Yu and Jiawei Han
1.
Introduction
2.
The Data Expiration Problem
3.
Classifier Ensemble for Drifting Concepts
4.
Experiments
5.
Discussion and Related Work
References
37
Mining High-Dimensional Data
Wei Wang and Jiong Yang
1.
Introduction
2.
Challenges
3.
Frequent Pattern
4.
Clustering
5.
Classification
References
38
Text Mining and Information Extraction
Moty Ben-Dov and Ronen Feldman
1.
Introduction
2.
Text Mining vs. Text Retrieval
3.
Task-Oriented Approaches vs. Formal Frameworks
4.
Task-Oriented Approaches
xiii
743
744
749
749
751
758
760
762
762
765
765
767
770
774
775
777
778
779
781
783
789
790
793
793
794
794
795
797
798
801
801
803
804
804
xiv
DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
5.
Formal Frameworks And Algorithm-Based Techniques
6.
Hybrid Approaches - TEG
7.
Text Mining - Visualization and Analytics
References
808
814
815
820
39
Spatial Data Mining
Shashi Shekhar, Pusheng Zhang and Yon Huang
Introduction
1.
Spatial Data
2.
3.
Spatial Outliers
Spatial Co-location Rules
4.
Predictive Models
5.
Spatial Clusters
6.
Summary
7.
833
834
837
841
844
848
849
Acknowledgments
References
849
850
40
Data Mining for Imbalanced Datasets: An Overview
Nitesh V. Chawla
Introduction
1.
Performance Measure
2.
Sampling Strategies
3.
4.
Ensemble-based Methods
Discussion
5.
References
833
853
853
854
858
860
862
863
41
Relational Data Mining
Saso Dzeroski
In a Nutshell
1.
2.
Inductive logic programming
Relational Association Rules
3.
4.
Relational Decision Trees
RDM Literature and Internet Resources
5.
References
869
869
874
884
889
894
895
42
Web Mining
Johannes Fiirnkranz
Introduction
1.
2.
Graph Properties of the Web
Web Search
3.
Text Classification
4.
Hypertext Classification
5.
Information Extraction and Wrapper Induction
6.
The Semantic Web
7.
899
899
900
902
904
905
907
908
Contents
8.
Web Usage Mining
9.
Collaborative Filtering
10.
Conclusion
References
43
A Review of Web Document Clustering Approaches
Nora Oikonomakou and Michalis Vazirgiannis
1.
Introduction
2.
Motivation for Document Clustering
3.
Web Document Clustering Approaches
4.
Comparison
5.
Conclusions and Open Issues
References
44
Causal Discovery
Hong Yao, Cory J. Butz, and Howard J. Hamilton
1.
Introduction
2.
Background Knowledge
3.
Theoretical Foundation
4.
Learning a DAG of CN by FDs
5.
Experimental Results
6.
Conclusion
References
45
Ensemble Methods For Classifiers
Lior Rokach
1.
Introduction
2.
Sequential Methodology
3.
Concurrent Methodology
4.
Combining Classifiers
5.
Ensemble Diversity
6.
Ensemble Size
7.
Cluster Ensemble
References
46
Decomposition Methodology for
Knowledge Discovery and Data Mining
Oded Maimon and Lior Rokach
1.
Introduction
2.
Decomposition Advantages
3.
The Elementary Decomposition Methodology
4.
The Decomposer's Characteristics
5.
The Relation to Other Methodologies
6.
Summary
XV
909
910
911
911
921
922
922
924
935
937
937
945
945
946
949
950
953
953
954
957
957
958
964
966
973
974
976
977
981
981
984
986
991
996
999
xvi
DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
References
47
Information Fusion
Vicenc Torra
1.
Introduction
2.
Preprocessing Data
3.
Building Data Models
4.
Information Extraction
5.
Conclusions
Acknowledgments
References
48
Parallel And Grid-Based Data Mining
Antonio Congiusta, Domenico Talia and Paolo Trunfio
1.
Introduction
2.
Parallel Data Mining
Grid-Based Data Mining
3.
4.
The Knowledge Grid
Summary
5.
References
49
Collaborative Data Mining
Steve Moyle
1.
Introduction
Remote Collaboration
2.
The Data Mining Process
3.
4.
Collaborative Data Mining Guidelines
5.
Discussion
Conclusions
6.
References
50
Organizational Data Mining
Hamid R. Nemati and Christopher D. Barko
1.
Introduction
2.
Organizational Data Mining
3.
ODM versus Data Mining
4.
Ongoing ODM Research
5.
ODM Advantages
6.
ODM Evolution
7.
Summary
References
51
Mining Time Series Data
999
1005
1005
1006
1009
1012
1012
1012
1013
1017
1018
1019
1027
1033
1038
1039
1043
1043
1044
1047
1048
1052
1053
1054
1057
1058
1059
1060
1062
1062
1063
1066
1066
1069
Contents
xvii
Chotirat Ann Ratanamahatana, Jessica Lin, Dimitrios Gunopulos, Eamonn Keogh,
Michail Vlachos and Gautam Das
1.
Introduction
1069
2.
Time Series Similarity Measures
1071
3.
Time Series Data Mining
1077
4.
Time Series Representations
1088
5.
Summary
1098
References
1098
Part VII
Applications
52
Data Mining in Medicine
Nada Lavrac and Blaz Zupan
1.
Introduction
2.
Symbolic Classification Methods
3.
Subsymbolic Classification Methods
4.
Other Methods Supporting Medical Knowledge Discovery
5.
Conclusions
Acknowledgments
References
53
Learning Information Patterns in
Biological Databases
Gautam B. Singh
1.
Background
2.
Learning Stochastic Pattern Models
3.
Searching for Meta-Patterns
4.
Conclusions
References
54
Data Mining for Selection of Manufacturing Processes
Bruno Agard and Andrew Kusiak
1.
Introduction
2.
Data Mining in Engineering
3.
Selection of Manufacturing Process with a Data Mining Approach
4.
Conclusion
References
55
Data Mining of Design Products and Processes
Yoram Reich
1.
Introduction
2.
Product Design Process
3.
Product Portfolio Management
4.
Conceptual Design
1107
1107
1109
1120
1126
1129
1129
1129
1139
1139
1141
1148
1156
1156
1159
1159
1160
1161
1165
1166
1167
1167
1169
1171
1172
xviii
DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
5.
Detailed Design
6.
Business and Manufacturing Process Planning
7.
Text Mining
8.
Observations and Future Advancements
9.
Epilogue
References
1175
1177
1178
1180
1182
1183
56
Data Mining in Telecommunications
GaryM. Weiss
1.
Introduction
2. Types of Telecommunication Data
3.
Data Mining Applications
4.
Conclusion
References
57
Data Mining for Financial Applications
Boris Kovalerchuk and Evgenii Vityaev
1.
Introduction: Financial Tasks
2.
Specifics of Data Mining in Finance
3.
Aspects of Data Mining Methodology in Finance
4.
Data Mining Models and Practice in Finance
5.
Conclusion
References
58
Data Mining for Intrusion Detection
Anoop Singhal and Sushil Jajodia
1.
Introduction
2.
Data Mining Basics
3.
Data Mining Meets Intrusion Detection
4.
Conclusions and Future Research Directions
References
59
Data Mining For Software Testing
Mark Last
1.
Introduction
2.
Mining Software Metrics Databases
3.
Interaction-Pattern Discovery in System Usage Data
4.
Using Data Mining in Functional Testing
5.
Summary
1189
1189
1190
1194
1199
1200
1203
1203
1205
1210
1214
1219
1221
1225
1225
1226
1228
1235
1236
1239
1239
1241
1242
1243
1246
Acknowledgments
References
1247
1247
60
Data Mining for CRM
1249
Contents
Kurt Thearling
1.
What is CRM?
2.
Data Mining and Campaign Management
3.
An Example: Customer Acquisition
61
Data Mining for Target Marketing
Nissan Levin and Jacob Zahavi
1.
Introduction
Modeling Process
2.
3.
Evaluation Metrics
4.
Segmentation Methods
5.
Predictive Modeling
6.
In-Market Timing
7.
Pitfalls of Targeting
Conclusions
8.
References
xix
1249
1251
1252
1261
1261
1263
1265
1268
1275
1281
1285
1297
1299
Part VII '. Software
62
Weka
1305
Eibe Frank, Mark Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer and
Ian H. Witten and Len Trigg
1.
Introduction
1305
References
1313
63
Oracle Data Mining
1315
Tamayo P., C. Berger, M. Campos, J. Yarmus, B. Milenova, A. Mozes, M. Taft, M.
Hornick, R. Krishnan, S. Thomas, M. Kelly, D. Mukhin, B. Haberstroh, S. Stephens,
andj. Myczkowski
1.
Introduction
1315
2.
The Mining-in-the-Database Paradigm
1317
3.
ODM Functionality and Algorithms
1319
4.
Text and Spatial Mining
1324
5.
ODM Examples
1325
6.
Conclusions
1327
References
1328
64
Building Data Mining Solutions with
OLE DB for DM and XML for Analysis
Zhaohui Tang, Jamie Maclennan and Pyungchul (Peter) Kim
1.
Introduction
2.
OLE DB for Data Mining
3.
Data Mining in SQL Server 2000
4.
Building Data Mining Application using OLE DB for Data Mining
5.
XML for Analysis
1331
1331
1332
1336
1338
1340
xx
DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK
6.
Conclusion
References
65
LERS—A Data Mining System
Jerzy W. Grzymala-Busse
1.
Introduction
2.
Input Data
3.
Rule Sets
4.
Main Features
5.
Final Remarks
References
66
GainSmarts Data Mining System
for Marketing
Nissan Levin and Jacob Zahavi
1.
Introduction
2.
Accessing GainSmarts
3.
Setting Up the Data for Modeling
4.
GainSmarts Modules
5.
Knowledge Evaluation
6.
Reporting
7.
Software Characteristics
8.
Applications
References
67
WizSoft's WizWhy
Abraham Meidan
1.
Introduction
2.
If-Then Rules
3.
If-and-Only-If Rules
4.
Data Summarization
5.
Interesting Phenomena
6.
Classifications
7.
Data Auditing
8.
WizWhy vs. other Data Mining Methods
References
68
DataEngine
Joseph Komem and Moti Schneider
1.
Overview
2.
Intelligent Technologies for Modeling and Control
3.
Work with DataEngine
References
Index
1343
1344
1347
1347
1348
1348
1349
1350
1350
1353
1353
1354
1355
1355
1360
1360
1363
1363
1363
1365
1365
1366
1366
1367
1367
1368
1368
1369
1369
1371
1371
1372
1374
1377
1379
Related documents