Download Contents

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Contents
Preface
List of Figures
List of Tables
1. B A S I C C O N C E P T S IN DATA MINING
1.1 Introduction
1.2 Data Scales
1.2.1 Data vs Information
1.2.2 Data Types
1.3 Data Categories
1.3.1 Standard Scales of Measurement
1.3.2 Nominal Scale
Coding of Nominal Variables
Binary Variable
Coding of Binary Variables
Symmetric vs Asymmetric Binary Variables
Ternary Variables
1.3.3 Ordinal Scale
1.3.4 Allowed Operations
1.3.5 Interval Scale
Allowed Operations on Interval Data
Interval Data Transformations
1.3.6 Ratio Scale
Operations on Ratio Data
Nonstandard Data
Numeric Data Discretisation
Entropy Based Discretisation
1.4 Databases and Data Warehouses
1.4.1 Data Warehouses
1.5 Data Mining
1.6 Supervised and Unsupervised Learning
1.6.1 Steps in Data Mining
1.6.2 Data Mining Approaches
1.6.3 Data Mining Query Language (DMQL)
v
xxi
xxiii
1
1
1
2
3
3
4
4
5
6
6
7
8
8
9
9
10
10
10
11
11
11
12
12
12
13
13
14
15
16
1.7
1.8
Some Applications
1.7.1 Banks
1.7.2 Communications
1.7.3 Government
1.7.4 Hospitals
1.7.5 Insurance
1.7.6 Sports
1.7.7 Miscellaneous
1.7.8 Summary
Exercises
References
2. DATA VISUALISATION T E C H N I Q U E S
2.1 What is Data Visualisation?
2.1.1 Visualisation Categories
2.1.2 Tables
2.1.3 Graphics
2.2 One Variable Diagrams
2.2.1 Line Charts
2.2.2 Bar Charts
2.2.3 Histogram
Desirable Qualities of a Histogram
2.2.4 Pictogram
2.2.5 Time Charts
2.2.6 Temporal Histograms
2.2.7 Spatial Histograms
2.2.8 Pareto Diagrams
2.2.9 Pie-Charts
2.2.10 Radar Charts
2.2.11 Frequency Polygons and Frequency Curves
2.2.12 Stem-and-Leaf Plots
2.2.13 Overlay Charts
2.3 Multi-variable Diagrams
2.3.1 Scatterplot
2.3.2 Bubble Chart
2.3.3 Contour Plots
2.4 Hierarchical Charts
2.4.1 Polar Trees
2.4.2 Cause-and-Effect Diagrams
2.4.3 Q-Q Plots
2.4.4 Chernoff Plots
2.4.5 Box and Whisker Plots
2.4.6 Stem Plots
2.4.7 Miscellaneous Plots and Charts
2.4.8 Visualisation in Data Mining
2.5 Software for Data Visualisation
2.6 Exercises
References
it"
16
16
17
17
17
18
18
18
18
21
'
23
23
24
24
25
25
25
26
27
28
29
29
30
30
31
33
34
35
35
36
37
37
38
38
39
39
39
40
42
42
42
43
44
44
44
46
3. P R O B A B I L I T Y A N D S T A T I S T I C S
3.1
3.2
49
Introduction
Probability
49
51
3.2.1
3.2.2
3.2.3
51
53
54
54
54
56
56
57
57
58
59
60
60
60
61
61
62
62
63
63
63
65
65
66
67
67
67
68
69
69
69
70
70
70
71
71
72
72
73
73
73
74
75
75
Different Ways to Express Probability
A Notation for Probability
Methods of Counting
Independence of Events
3.2.4 Rules of Probability
Probability Model
Entropy vs Probability
3.3 Venn Diagrams
3.3.1 De'Morgan's Laws
3.4 Bayes Theorem
3.4.1 Bayes Theorem for Conditional Probability
Odds-Likelihood Ratio Form of Bayes Theorem
Product Rule for Conditional Probability
3.4.2 Bayes Classification Rule
Rule of Expected Utility
3.5 Mathematical Expectation
3.6 Statistics
3.6.1 Population vs Sample
3.6.2 Parameter vs Statistic
3.7 Measures of Location
3.7.1 Mean, Median and Mode
Weighted Mean
Advantages of Mean
3.7.2 Median
Advantages of Median
3.7.3 Mode
Advantages of Mode
3.7.4 Geometric Mean
3.7.5 Harmonic Mean
3.8 Measures of Dispersion
3.8.1 Range
3.8.2 Inter-Quartile Range
3.8.3 Mean Absolute Deviation
3.8.4 Variance
3.9 Outliers in Data
3.9.1 Spatial vs Temporal Outliers
3.9.2 Graphical Detection of Outliers
3.10 Data Transformations
3.10.1 Change of Origin
3.10.2 Change of Scale
3.10.3 Change of Origin and Scale
3.10.4 Min-max Transformation
3.10.5 Standard Normalisation
3.10.6 Nonlinear Transformations
3.11 Regression Basics
3.11.1 Scatterplots and Regression
Advantages of Scatter Plots
3.11.2 Simple Linear Regression
3.11.3 Ordinary Least Squares (OLS)
3.11.4 Weighted Least Squares (WLS)
3.11.5 Correlation Coefficient
3.11.6 Prom Scatterplot to Correlation
Interpretation of Correlation Coefficient
3.11.7 Multivariate Data
3.12 Multiple Linear Regression (MLR)
3.13 Monte Carlo Methods
Components of Monte Carlo Simulation
3.14 Contingency Tables
3.15 Exercises
References
4. DATAWAREHOUSING AND O L A P
4.1 The Datawarehouse
4.1.1 Goals of Data Warehousing
4.1.2 Advantages of Data Warehousing
4.1.3 Datawarehouses vs Databases
4.1.4 Operational Data Stores (ODS)
4.1.5 Metadata Catalogs
4.1.6 The Datawarehousing Team
4.1.7 Datawarehouse Architecture
4.1.8 Building a Datawarehouse
4.2 Data Marts
Advantages of Datamarts
4.3 ETL
4.3.1 ETL Tools
4.4 Data Staging
4.4.1 Data Extraction
4.4.2 Data Cleansing
4.4.3 Replacing Missing Values
4.4.4 Data Transformation
4.5 Spatial Datawarehouses (SDW)
4.6 Distributed Datawarehouses
Advantages of DDW
4.6.1 Virtual Data Warehouses (VDW)
4.6.2 Web-based Data Warehouses (WDW)
4.7 DW Indexing
4.8 Security in Datawarehousing
4.9 What is OLAP?
4.10 OLAP vs OLTP
4.10.1 Advantages of OLAP
•
75
76
76
78
78
83
84
84
84
85
85
86
87
87
88
91
93
93
95
96
98
98
99
99
100
102
104
105
106
106
106
106
107
108
108
110
110
112
' . . 113
113
114
114
115
116
118
4.11 Data Cubes and Cuboids
4.11.1 Dimensional Modeling
4.11.2 Concept Hierarchy
Fact Table
Additive Facts
Dimension Table
4.12 OLAP Schemas
4.12.1 Star Schema
4.12.2 Snowflake Schema
4.12.3 Fact Constellation Schema
4.13 OLAP Operations
4.13.1 Roll-up
4.13.2 Drill-Down
4.13.3 Slicing
4.13.4 Dicing
4.13.5 Pivoting
4.14 OLAP Security
4.15 OLAP Software
4.16 Exercises
References
5. DECISION T R E E S
5.1 Graph Theory
5.1.1 Drawing Graphs
5.1.2 Bipartite Graphs
Constructing Bipartite Graphs
5.2 Trees
Drawing Trees
5.3 Decision Trees
Chance and Terminal Nodes
5.3.1 Advantages of Decision Trees
5.3.2 Disadvantages of Decision Trees
5.3.3 Classification
5.3.4 Production Rules
5.4 Induction Algorithms
5.4.1 ID3 Algorithm
5.4.2 Building a DT
5.4.3 C4.5 Algorithm
5.5 Measures for Node Splitting
5.5.1 Gini's Index Measure
5.5.2 Shannon's Entropy Measure
5.5.3 Minimum Classification Error Measure
Gain and Impurity
5.5.4 CHi-squared Automatic Interaction Detector (CHAID)
5.5.5 Classification and Regression Tree (CART)
119
120
120
120
121
121
121
122
123
124
125
125
125
126
127
127
128
128
128
130
133
133
135
137
139
139
140
140
141
141
144
144
146
149
149
150
150
150
151
151
151
152
153
154
5.6
5.7
Pruning Decision Trees
Fuzzy Decision Trees
Decision Tables
5.8 Applications
Fraud Detection
5.9 Software for Decision Trees
5.10 Exercises
References
6. ASSOCIATION R U L E S
6.1 Association Rules
6.1.1 Antecedent and Consequent
6.2 Association Rule Measures
6.2.1 Confidence and Support
6.2.2 Cross-purchase Analysis
6.2.3 Categorical Variables
6.2.4 Sequence-purchase Analysis
6.3 Association Rule Mining
6.3.1 Activity Indicators
6.3.2 Computational Complexity of ARM
6.3.3 Sparse Association Rules
6.3.4 Rare Associations
6.4 Temporal Association Rules
6.4.1 Pareto Analysis
6.4.2 Paired Comparisons Analysis
6.4.3 Negative Associations
6.4.4 Fuzzy Association Rules
6.4.5 Plan Mining
6.5 Generalisations of Association Rules
6.6 Extended Association Rules
6.6.1 Multi-Level Association Rules (MLAR)
6.6.2 Multi-Dimensional Association Rules (MDAR)
6.6.3 Constrained Association Rules
6.6.4 Rule Constraints in Association Rule Mining
6.6.5 Weighted Association Rule Mining (WARM)
6.7 Algorithms for Association Rules
6.8 Applications
6.8.1 Purchase Domain Application
6.8.2 Diagnosis
6.8.3 Inventory Arrangement
6.8.4 Fraud Detection
6.9 Software for Association Rules
6.10 Exercises
References
155
156
157
157
157
159
159
161
165
165
167
168
168
170
171
171
172
173
174
175
177
177
178
178
179
179
179
180
180
181
181
182
182
182
183
184
184
184
185
185
186
186
188
7. C L U S T E R ANALYSIS
7.1 Meaning of Clustering
7.1.1 Geometric Interpretation
Cluster Display
Cluster Formation
7.1.2 Cluster Analysis Step-by-Step
7.2 Similarity Metrics
7.2.1 Euclidean Distance Metric (L 2 Metric)
7.2.2 Manhattan Metric (L1 Metric)
7.2.3 Minkowski Metric
7.2.4 Mahalanobis' Distance Metric
7.2.5 Chebychev Metric (L Metric))
7.2.6 Other Metrics
7.3 Clustering Algorithms
7.3.1 Hierarchical Clustering Algorithms (HCA)
Agglomerative Algorithm
Divisive Algorithm
7.3.2 Partitioning Algorithms
K-means Clustering Algorithm
7.3.3 Density-based Methods
7.4 Cluster Validation Techniques (CVT)
7.5 Applications
7.5.1 Marketing
7.5.2 Insurance
7.5.3 Medical Sciences
7.5.4 Web Mining
7.5.5 Aviation
7.5.6 Miscellaneous Applications
7.6 Software for Clustering
7.7 Exercises
References
8. G E N E T I C A L G O R I T H M S
8.1 Introduction
8.1.1 Searching for Optimality
8.2 Genetic Learning Model
8.2.1 Advantages of Genetic Algorithms
8.2.2 Disadvantages
8.2.3 Steps in GA
8.2.4 A Notation for GA
8.3 Genetic Operators
8.3.1 Selection
Roulette Wheel Selection
Advantages of Roulette Wheel Selection
Disadvantages of Roulette Wheel Selection
Tournament Selection
8.3.2 Simple Crossover (SX)
191
191
192
192
193
194
194
195
195
196
196
196
196
197
197
198
199
200
- 201
202
203
203
203
204
205
205
206
206
207
207
209
213
213
214
215
220
222
223
224
224
225
225
226
227
227
227
8.3.3
Uniform Crossover (UX)
Advantages of Uniform Crossover
8.3.4 Multi-Crossover (MX)
8.3.5 Mutation
8.3.6 Inversion
8.3.7 Advanced Operators
8.3.8 Arithmetic Crossover (AX)
8.4 General Alphabet Set
8.5 Schema Theorem
8.5.1 Elitism
8.5.2 Epistasis
8.6 Implementation of GA
8.7 Parallel GA (PGA)
8.7.1 Multi-Stage GA
8.7.2 Neuro-Genetic Models (NGM)
8.8 Genetic Programming
8.9 Applications
Insurance
Fraud Detection
Miscellaneous Applications
8.10 Software for GA
8.11 Exercises
References
9. N E U R A L N E T W O R K S
9.1 Introduction to Neural Networks
Neural Network Inspiration
9.1.1 Advantages of Neural Networks
9.2 Components of Neural Networks
9.2.1 Layering Concept.
Data Transformation and Communication . . . .
Training Phase
Training Algorithms
9.2.2 Activation Functions
Sigmoid Function
Running Phase
Pruning Phase
9.3 Network Topologies
FFN vs FBN
9.3.1 Special Types of ANNs
Single Layer Perceptron (SLP)
Multi-Layer Perceptron (MLP)
Knowledge-based Networks
Kohonen Networks
Self Organising Map (SOM)
Fuzzy-Neural Networks (FNN)
Stochastic Neural Networks
Radial Basis Function (RBF) Networks
228
229
229
229
230
231
232
232
238
240
240
240
241
242
242
243
243
243
244
246
247
248
249
253
253
255
256
258
258
259
260
260
261
262
262
263
263
264
265
265
266
266
266
267
269
269
269
9.4
9.5
9.6
Probabilistic Neural Networks (PNN)
Hopfield Networks (HN)
Miscellaneous Types
9.3.2 Neural Networks vs MLR
9.3.3 Back-propagation Learning
Backpropagation Algorithm (BPA)
Ill-Conditioning
Implementation Issues
Applications
Advertising and Media Planning
Pattern Recognition
Classification
Data Compression
Speaker Identification
Web Mining
Biometrics
Miscellaneous Applications
Software for Neural Networks
Exercises
References
10. W E B MINING
10.1 Web Sites
10.1.1 Web Pages
10.1.2 Search Engines
10.1.3 Indexers
10.1.4 Information Extraction
10.1.5 Linguistic Search Engines
10.2 Web Mining
10.2.1 Advantages of Web Mining
10.2.2 Implementing Web Mining
10.3 Web Content Mining (WCM)
10.3.1 Web Usage Mining (WUM)
10.3.2 Web User Quality Mining
10.4 Web Structure Mining (WSM)
10.4.1 Link Mining
10.4.2 Measures for Web Structure Mining
10.4.3 Link Categorisation
10.4.4 Link Stepping
10.4.5 Links Analysis
10.4.6 Web Query Mining (WQM)
10.4.7 Query Performance Measures
F-score
10.5 Semantic Web Mining
10.5.1 Metadata Mining
10.5.2 Multilingual Web Mining
10.5.3 Web Personalisers
270
271
272
273
273
274
276
276
276
277
278
279
280
280
280
281
282
282
283
285
295
295
295
297
298
298
298
299
299
300
301
301
302
303
303
304
305
306
307
307
307
308
309
309
310
310
10.6 Text Mining
10.6.1 Text Mining Workflow
10.6.2 Pre-processing Text
10.6.3 Text Categorisation
10.6.4 Mining Textified Documents
10.6.5 Temporal Text Mining (TTM)
10.6.6 Distributed Text Mining (DTM)
10.6.7 Metrics for Text Mining
10.7 Image Mining
10.7.1 Issues in Image Mining
10.7.2 Multimedia Mining
10.7.3 Table Mining
10.7.4 Data Stream Mining (DSM)
10.8 Applications
Spam-Mail Classification
Web-page Clustering
Web Marketing
Miscellaneous Applications
10.9 Software for Web and Text Mining
10.10 Exercises
References
.
11. S U P P O R T V E C T O R M A C H I N E S
11.1 Introduction
11.1.1 Structural Risk Minimisation Principle
11.1.2 Linear Separability
11.1.3 Solution Techniques
11.1.4 Hyperplane Classifiers
11.1.5 SVM Classifier
11.1.6 Overlapping Classes
11.1.7 Simple SVM (SSVM)
11.1.8 Lagrangian Formulation
11.1.9 Dual SVM Formulation
Properties of Dual SVM
11.2 Weighted SVM (W-SVM)
11.3 Multi-class SVM (MC-SVM)
11.3.1 Pair-wise SSVM (One-versus-One [OVO])
11.3.2 One-versus-All (OVA) SVM
11.4 Soft-Margin SVM (SM-SVM)
11.4.1 Weighted Soft Margin SVM (WSM-SVM)
11.4.2
ny-SVM
11.4.3 Pruning
11.5 Kernels
11.5.1 Properties of Kernels
11.5.2 Mercer's Theorem
11.6 Nonlinear SVM (NL-SVM)
11.6.1 Other Kernel Algorithms
311
311
313
313
314
314
315
316
319
319
319
320
320
321
321
322
323
323
323
324
325
331
331
332
333
333
333
334
335
335
338
338
339
340
340
340
341
342
344
344
345
345
346
346
347
349
11.7 Support Vector Regression (SVR)
11.8 Applications of SVM
11.8.1 Medical Application
11.8.2 Text Categorisation
11.9 SVM Software
l l . l 0 Exercises
References
12. L A T E N T S E M A N T I C I N D E X I N G
12.1 Vector Space Models
12.1.1 Term-by-Document Matrix
12.1.2 Textual IR
12.1.3 Geometric Interpretation
12.2 Latent Semantic Analysis
12.2.1 Steps in LSA
12.2.2 Characteristics of LSA
12.2.3 Advantages of LSA
12.2.4 Disadvantages of LSA
12.3 Singular Value Decomposition
The SVD Algorithm
12.4 LSI Query
12.4.1 Query Processing
12.5 Applications of LSI
12.5.1 LSI in Information Retrieval
12.6 Software for LSI
12.7 Exercises
References
Appendix-A:
Solution
Index
The Backpropagation
to Selected
Exercises
....
'
349
351
351
352
355
355
357
361
361
362
362
363
365
365
366
366
367
369
370
370
370
373
373
375
376
378
Algorithm
381
383
399
Related documents