Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Contents Preface List of Figures List of Tables 1. B A S I C C O N C E P T S IN DATA MINING 1.1 Introduction 1.2 Data Scales 1.2.1 Data vs Information 1.2.2 Data Types 1.3 Data Categories 1.3.1 Standard Scales of Measurement 1.3.2 Nominal Scale Coding of Nominal Variables Binary Variable Coding of Binary Variables Symmetric vs Asymmetric Binary Variables Ternary Variables 1.3.3 Ordinal Scale 1.3.4 Allowed Operations 1.3.5 Interval Scale Allowed Operations on Interval Data Interval Data Transformations 1.3.6 Ratio Scale Operations on Ratio Data Nonstandard Data Numeric Data Discretisation Entropy Based Discretisation 1.4 Databases and Data Warehouses 1.4.1 Data Warehouses 1.5 Data Mining 1.6 Supervised and Unsupervised Learning 1.6.1 Steps in Data Mining 1.6.2 Data Mining Approaches 1.6.3 Data Mining Query Language (DMQL) v xxi xxiii 1 1 1 2 3 3 4 4 5 6 6 7 8 8 9 9 10 10 10 11 11 11 12 12 12 13 13 14 15 16 1.7 1.8 Some Applications 1.7.1 Banks 1.7.2 Communications 1.7.3 Government 1.7.4 Hospitals 1.7.5 Insurance 1.7.6 Sports 1.7.7 Miscellaneous 1.7.8 Summary Exercises References 2. DATA VISUALISATION T E C H N I Q U E S 2.1 What is Data Visualisation? 2.1.1 Visualisation Categories 2.1.2 Tables 2.1.3 Graphics 2.2 One Variable Diagrams 2.2.1 Line Charts 2.2.2 Bar Charts 2.2.3 Histogram Desirable Qualities of a Histogram 2.2.4 Pictogram 2.2.5 Time Charts 2.2.6 Temporal Histograms 2.2.7 Spatial Histograms 2.2.8 Pareto Diagrams 2.2.9 Pie-Charts 2.2.10 Radar Charts 2.2.11 Frequency Polygons and Frequency Curves 2.2.12 Stem-and-Leaf Plots 2.2.13 Overlay Charts 2.3 Multi-variable Diagrams 2.3.1 Scatterplot 2.3.2 Bubble Chart 2.3.3 Contour Plots 2.4 Hierarchical Charts 2.4.1 Polar Trees 2.4.2 Cause-and-Effect Diagrams 2.4.3 Q-Q Plots 2.4.4 Chernoff Plots 2.4.5 Box and Whisker Plots 2.4.6 Stem Plots 2.4.7 Miscellaneous Plots and Charts 2.4.8 Visualisation in Data Mining 2.5 Software for Data Visualisation 2.6 Exercises References it" 16 16 17 17 17 18 18 18 18 21 ' 23 23 24 24 25 25 25 26 27 28 29 29 30 30 31 33 34 35 35 36 37 37 38 38 39 39 39 40 42 42 42 43 44 44 44 46 3. P R O B A B I L I T Y A N D S T A T I S T I C S 3.1 3.2 49 Introduction Probability 49 51 3.2.1 3.2.2 3.2.3 51 53 54 54 54 56 56 57 57 58 59 60 60 60 61 61 62 62 63 63 63 65 65 66 67 67 67 68 69 69 69 70 70 70 71 71 72 72 73 73 73 74 75 75 Different Ways to Express Probability A Notation for Probability Methods of Counting Independence of Events 3.2.4 Rules of Probability Probability Model Entropy vs Probability 3.3 Venn Diagrams 3.3.1 De'Morgan's Laws 3.4 Bayes Theorem 3.4.1 Bayes Theorem for Conditional Probability Odds-Likelihood Ratio Form of Bayes Theorem Product Rule for Conditional Probability 3.4.2 Bayes Classification Rule Rule of Expected Utility 3.5 Mathematical Expectation 3.6 Statistics 3.6.1 Population vs Sample 3.6.2 Parameter vs Statistic 3.7 Measures of Location 3.7.1 Mean, Median and Mode Weighted Mean Advantages of Mean 3.7.2 Median Advantages of Median 3.7.3 Mode Advantages of Mode 3.7.4 Geometric Mean 3.7.5 Harmonic Mean 3.8 Measures of Dispersion 3.8.1 Range 3.8.2 Inter-Quartile Range 3.8.3 Mean Absolute Deviation 3.8.4 Variance 3.9 Outliers in Data 3.9.1 Spatial vs Temporal Outliers 3.9.2 Graphical Detection of Outliers 3.10 Data Transformations 3.10.1 Change of Origin 3.10.2 Change of Scale 3.10.3 Change of Origin and Scale 3.10.4 Min-max Transformation 3.10.5 Standard Normalisation 3.10.6 Nonlinear Transformations 3.11 Regression Basics 3.11.1 Scatterplots and Regression Advantages of Scatter Plots 3.11.2 Simple Linear Regression 3.11.3 Ordinary Least Squares (OLS) 3.11.4 Weighted Least Squares (WLS) 3.11.5 Correlation Coefficient 3.11.6 Prom Scatterplot to Correlation Interpretation of Correlation Coefficient 3.11.7 Multivariate Data 3.12 Multiple Linear Regression (MLR) 3.13 Monte Carlo Methods Components of Monte Carlo Simulation 3.14 Contingency Tables 3.15 Exercises References 4. DATAWAREHOUSING AND O L A P 4.1 The Datawarehouse 4.1.1 Goals of Data Warehousing 4.1.2 Advantages of Data Warehousing 4.1.3 Datawarehouses vs Databases 4.1.4 Operational Data Stores (ODS) 4.1.5 Metadata Catalogs 4.1.6 The Datawarehousing Team 4.1.7 Datawarehouse Architecture 4.1.8 Building a Datawarehouse 4.2 Data Marts Advantages of Datamarts 4.3 ETL 4.3.1 ETL Tools 4.4 Data Staging 4.4.1 Data Extraction 4.4.2 Data Cleansing 4.4.3 Replacing Missing Values 4.4.4 Data Transformation 4.5 Spatial Datawarehouses (SDW) 4.6 Distributed Datawarehouses Advantages of DDW 4.6.1 Virtual Data Warehouses (VDW) 4.6.2 Web-based Data Warehouses (WDW) 4.7 DW Indexing 4.8 Security in Datawarehousing 4.9 What is OLAP? 4.10 OLAP vs OLTP 4.10.1 Advantages of OLAP • 75 76 76 78 78 83 84 84 84 85 85 86 87 87 88 91 93 93 95 96 98 98 99 99 100 102 104 105 106 106 106 106 107 108 108 110 110 112 ' . . 113 113 114 114 115 116 118 4.11 Data Cubes and Cuboids 4.11.1 Dimensional Modeling 4.11.2 Concept Hierarchy Fact Table Additive Facts Dimension Table 4.12 OLAP Schemas 4.12.1 Star Schema 4.12.2 Snowflake Schema 4.12.3 Fact Constellation Schema 4.13 OLAP Operations 4.13.1 Roll-up 4.13.2 Drill-Down 4.13.3 Slicing 4.13.4 Dicing 4.13.5 Pivoting 4.14 OLAP Security 4.15 OLAP Software 4.16 Exercises References 5. DECISION T R E E S 5.1 Graph Theory 5.1.1 Drawing Graphs 5.1.2 Bipartite Graphs Constructing Bipartite Graphs 5.2 Trees Drawing Trees 5.3 Decision Trees Chance and Terminal Nodes 5.3.1 Advantages of Decision Trees 5.3.2 Disadvantages of Decision Trees 5.3.3 Classification 5.3.4 Production Rules 5.4 Induction Algorithms 5.4.1 ID3 Algorithm 5.4.2 Building a DT 5.4.3 C4.5 Algorithm 5.5 Measures for Node Splitting 5.5.1 Gini's Index Measure 5.5.2 Shannon's Entropy Measure 5.5.3 Minimum Classification Error Measure Gain and Impurity 5.5.4 CHi-squared Automatic Interaction Detector (CHAID) 5.5.5 Classification and Regression Tree (CART) 119 120 120 120 121 121 121 122 123 124 125 125 125 126 127 127 128 128 128 130 133 133 135 137 139 139 140 140 141 141 144 144 146 149 149 150 150 150 151 151 151 152 153 154 5.6 5.7 Pruning Decision Trees Fuzzy Decision Trees Decision Tables 5.8 Applications Fraud Detection 5.9 Software for Decision Trees 5.10 Exercises References 6. ASSOCIATION R U L E S 6.1 Association Rules 6.1.1 Antecedent and Consequent 6.2 Association Rule Measures 6.2.1 Confidence and Support 6.2.2 Cross-purchase Analysis 6.2.3 Categorical Variables 6.2.4 Sequence-purchase Analysis 6.3 Association Rule Mining 6.3.1 Activity Indicators 6.3.2 Computational Complexity of ARM 6.3.3 Sparse Association Rules 6.3.4 Rare Associations 6.4 Temporal Association Rules 6.4.1 Pareto Analysis 6.4.2 Paired Comparisons Analysis 6.4.3 Negative Associations 6.4.4 Fuzzy Association Rules 6.4.5 Plan Mining 6.5 Generalisations of Association Rules 6.6 Extended Association Rules 6.6.1 Multi-Level Association Rules (MLAR) 6.6.2 Multi-Dimensional Association Rules (MDAR) 6.6.3 Constrained Association Rules 6.6.4 Rule Constraints in Association Rule Mining 6.6.5 Weighted Association Rule Mining (WARM) 6.7 Algorithms for Association Rules 6.8 Applications 6.8.1 Purchase Domain Application 6.8.2 Diagnosis 6.8.3 Inventory Arrangement 6.8.4 Fraud Detection 6.9 Software for Association Rules 6.10 Exercises References 155 156 157 157 157 159 159 161 165 165 167 168 168 170 171 171 172 173 174 175 177 177 178 178 179 179 179 180 180 181 181 182 182 182 183 184 184 184 185 185 186 186 188 7. C L U S T E R ANALYSIS 7.1 Meaning of Clustering 7.1.1 Geometric Interpretation Cluster Display Cluster Formation 7.1.2 Cluster Analysis Step-by-Step 7.2 Similarity Metrics 7.2.1 Euclidean Distance Metric (L 2 Metric) 7.2.2 Manhattan Metric (L1 Metric) 7.2.3 Minkowski Metric 7.2.4 Mahalanobis' Distance Metric 7.2.5 Chebychev Metric (L Metric)) 7.2.6 Other Metrics 7.3 Clustering Algorithms 7.3.1 Hierarchical Clustering Algorithms (HCA) Agglomerative Algorithm Divisive Algorithm 7.3.2 Partitioning Algorithms K-means Clustering Algorithm 7.3.3 Density-based Methods 7.4 Cluster Validation Techniques (CVT) 7.5 Applications 7.5.1 Marketing 7.5.2 Insurance 7.5.3 Medical Sciences 7.5.4 Web Mining 7.5.5 Aviation 7.5.6 Miscellaneous Applications 7.6 Software for Clustering 7.7 Exercises References 8. G E N E T I C A L G O R I T H M S 8.1 Introduction 8.1.1 Searching for Optimality 8.2 Genetic Learning Model 8.2.1 Advantages of Genetic Algorithms 8.2.2 Disadvantages 8.2.3 Steps in GA 8.2.4 A Notation for GA 8.3 Genetic Operators 8.3.1 Selection Roulette Wheel Selection Advantages of Roulette Wheel Selection Disadvantages of Roulette Wheel Selection Tournament Selection 8.3.2 Simple Crossover (SX) 191 191 192 192 193 194 194 195 195 196 196 196 196 197 197 198 199 200 - 201 202 203 203 203 204 205 205 206 206 207 207 209 213 213 214 215 220 222 223 224 224 225 225 226 227 227 227 8.3.3 Uniform Crossover (UX) Advantages of Uniform Crossover 8.3.4 Multi-Crossover (MX) 8.3.5 Mutation 8.3.6 Inversion 8.3.7 Advanced Operators 8.3.8 Arithmetic Crossover (AX) 8.4 General Alphabet Set 8.5 Schema Theorem 8.5.1 Elitism 8.5.2 Epistasis 8.6 Implementation of GA 8.7 Parallel GA (PGA) 8.7.1 Multi-Stage GA 8.7.2 Neuro-Genetic Models (NGM) 8.8 Genetic Programming 8.9 Applications Insurance Fraud Detection Miscellaneous Applications 8.10 Software for GA 8.11 Exercises References 9. N E U R A L N E T W O R K S 9.1 Introduction to Neural Networks Neural Network Inspiration 9.1.1 Advantages of Neural Networks 9.2 Components of Neural Networks 9.2.1 Layering Concept. Data Transformation and Communication . . . . Training Phase Training Algorithms 9.2.2 Activation Functions Sigmoid Function Running Phase Pruning Phase 9.3 Network Topologies FFN vs FBN 9.3.1 Special Types of ANNs Single Layer Perceptron (SLP) Multi-Layer Perceptron (MLP) Knowledge-based Networks Kohonen Networks Self Organising Map (SOM) Fuzzy-Neural Networks (FNN) Stochastic Neural Networks Radial Basis Function (RBF) Networks 228 229 229 229 230 231 232 232 238 240 240 240 241 242 242 243 243 243 244 246 247 248 249 253 253 255 256 258 258 259 260 260 261 262 262 263 263 264 265 265 266 266 266 267 269 269 269 9.4 9.5 9.6 Probabilistic Neural Networks (PNN) Hopfield Networks (HN) Miscellaneous Types 9.3.2 Neural Networks vs MLR 9.3.3 Back-propagation Learning Backpropagation Algorithm (BPA) Ill-Conditioning Implementation Issues Applications Advertising and Media Planning Pattern Recognition Classification Data Compression Speaker Identification Web Mining Biometrics Miscellaneous Applications Software for Neural Networks Exercises References 10. W E B MINING 10.1 Web Sites 10.1.1 Web Pages 10.1.2 Search Engines 10.1.3 Indexers 10.1.4 Information Extraction 10.1.5 Linguistic Search Engines 10.2 Web Mining 10.2.1 Advantages of Web Mining 10.2.2 Implementing Web Mining 10.3 Web Content Mining (WCM) 10.3.1 Web Usage Mining (WUM) 10.3.2 Web User Quality Mining 10.4 Web Structure Mining (WSM) 10.4.1 Link Mining 10.4.2 Measures for Web Structure Mining 10.4.3 Link Categorisation 10.4.4 Link Stepping 10.4.5 Links Analysis 10.4.6 Web Query Mining (WQM) 10.4.7 Query Performance Measures F-score 10.5 Semantic Web Mining 10.5.1 Metadata Mining 10.5.2 Multilingual Web Mining 10.5.3 Web Personalisers 270 271 272 273 273 274 276 276 276 277 278 279 280 280 280 281 282 282 283 285 295 295 295 297 298 298 298 299 299 300 301 301 302 303 303 304 305 306 307 307 307 308 309 309 310 310 10.6 Text Mining 10.6.1 Text Mining Workflow 10.6.2 Pre-processing Text 10.6.3 Text Categorisation 10.6.4 Mining Textified Documents 10.6.5 Temporal Text Mining (TTM) 10.6.6 Distributed Text Mining (DTM) 10.6.7 Metrics for Text Mining 10.7 Image Mining 10.7.1 Issues in Image Mining 10.7.2 Multimedia Mining 10.7.3 Table Mining 10.7.4 Data Stream Mining (DSM) 10.8 Applications Spam-Mail Classification Web-page Clustering Web Marketing Miscellaneous Applications 10.9 Software for Web and Text Mining 10.10 Exercises References . 11. S U P P O R T V E C T O R M A C H I N E S 11.1 Introduction 11.1.1 Structural Risk Minimisation Principle 11.1.2 Linear Separability 11.1.3 Solution Techniques 11.1.4 Hyperplane Classifiers 11.1.5 SVM Classifier 11.1.6 Overlapping Classes 11.1.7 Simple SVM (SSVM) 11.1.8 Lagrangian Formulation 11.1.9 Dual SVM Formulation Properties of Dual SVM 11.2 Weighted SVM (W-SVM) 11.3 Multi-class SVM (MC-SVM) 11.3.1 Pair-wise SSVM (One-versus-One [OVO]) 11.3.2 One-versus-All (OVA) SVM 11.4 Soft-Margin SVM (SM-SVM) 11.4.1 Weighted Soft Margin SVM (WSM-SVM) 11.4.2 ny-SVM 11.4.3 Pruning 11.5 Kernels 11.5.1 Properties of Kernels 11.5.2 Mercer's Theorem 11.6 Nonlinear SVM (NL-SVM) 11.6.1 Other Kernel Algorithms 311 311 313 313 314 314 315 316 319 319 319 320 320 321 321 322 323 323 323 324 325 331 331 332 333 333 333 334 335 335 338 338 339 340 340 340 341 342 344 344 345 345 346 346 347 349 11.7 Support Vector Regression (SVR) 11.8 Applications of SVM 11.8.1 Medical Application 11.8.2 Text Categorisation 11.9 SVM Software l l . l 0 Exercises References 12. L A T E N T S E M A N T I C I N D E X I N G 12.1 Vector Space Models 12.1.1 Term-by-Document Matrix 12.1.2 Textual IR 12.1.3 Geometric Interpretation 12.2 Latent Semantic Analysis 12.2.1 Steps in LSA 12.2.2 Characteristics of LSA 12.2.3 Advantages of LSA 12.2.4 Disadvantages of LSA 12.3 Singular Value Decomposition The SVD Algorithm 12.4 LSI Query 12.4.1 Query Processing 12.5 Applications of LSI 12.5.1 LSI in Information Retrieval 12.6 Software for LSI 12.7 Exercises References Appendix-A: Solution Index The Backpropagation to Selected Exercises .... ' 349 351 351 352 355 355 357 361 361 362 362 363 365 365 366 366 367 369 370 370 370 373 373 375 376 378 Algorithm 381 383 399