Download Data Mining Demystified ver2

Data Mining Demystified John Aleshunas Fall Faculty Institute October 2006 Prediction is very hard, especially when it's about the future. - Yogi Berra Data Mining Stories    “My bank called and said that they saw that I bought two surfboards at Laguna Beach, California.” - credit card fraud detection The NSA is using data mining to analyze telephone call data to track al’Qaeda activities Victoria’s Secret uses data mining to control product distribution based on typical customer buying patterns at individual stores Preview  Why data mining?  Example data sets  Data mining methods  Example application of data mining  Social issues of data mining Why Data Mining?  Database systems have been around since the 1970s  Organizations have a vast digital history of the day-to-day pieces of their processes  Simple queries no longer provide satisfying results   They take too long to execute They cannot help us find new opportunities Source: Han Why Data Mining?    Data doubles about every year while useful information seems to be decreasing Vast data stores overload traditional decision making processes We are data rich, but information poor Source: Han Data Mining: a definition Simply stated, data mining refers to the extraction of knowledge from large amounts of data. Data Mining Models A Taxonomy Data Mining Predictive Classification Descriptive Clustering Time Series Analysis Regression Prediction Source: Dunham Association Rules Summarization Sequence Discovery Example Datasets  Iris  Wine  Diabetes Iris Dataset  Created by R.A. Fisher (1936)  150 instances  Three cultivars (Setosa, Virginica, Versicolor) 50 instances each  4 measurements (petal width, petal length, sepal width, sepal length)  One cultivar (Setosa) is easily separable, the others are not – noisy data Source: Fisher Iris Dataset Analysis Sepal Width 5 4.5 4 Sepal Width (cm) 3.5 3 Iris-Setosa 2.5 Iris-Versicolor Iris-Verginica 2 1.5 1 0.5 0 0 10 20 30 40 Num ber of Records (Integers) (Figure 2) 50 60 Wine Dataset   This data is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different varieties. 153 instances with 13 constituents found in each of the three types of wines. Source: UCI Machine Learning Repository Wine Dataset Analysis Flavinoids Ash 6 3.5 3 5 2.5 Class 1 3 Class 2 Class 3 Values Value 4 Class 1 2 Class 2 1.5 Class 3 2 1 1 0.5 0 0 0 10 20 30 40 Instance 50 60 70 0 10 20 30 40 Instances 50 60 70 Diabetes Dataset     Data is based on a population of women who were at least 21 years old of Pima Indian heritage and living near Phoenix in 1990 768 instances 9 attributes (Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function, Age, Diabetes) Dataset has many missing values, only 532 instances are complete Source: UCI Machine Learning Repository Diabetes Dataset Analysis PG Concentration Diastlic BP 250 140 120 200 100 Healthy Sick 100 Values Values 150 80 Healthy Sick 60 40 50 20 0 0 0 100 200 300 Instances 400 500 600 0 100 200 300 Instances 400 500 600 Classification   Classification builds a model using a training dataset with known classes of data That model is used to classify new, unknown data into those classes Classification Techniques  K-Nearest Neighbors  Decision Tree Classification (ID3, C4.5) K-Nearest Neighbors Example A A A B A A B B X • Simple to implement B A Easy to explain A A B B B A A A • B B B • Sensitive to the selection of the classification population • Not always conclusive for complex data K-Nearest Neighbors Example MISCLASSIFICATION PERCENTAGES Iris Dataset All Attributes Petal Length and Petal Width Setosa 0/150 = 0% 0/150 = 0% Versicolor 0/150 = 0% 0/150 = 0% Virginica 9/150 = 6% 7/150 = 4.67% Total 6% 4.67% Wine Dataset All Attributes Phenols, Flavanoids, OD280/OD315 Class 1 0/153 = 0% 2/153 = 1.31% Class 2 9/153 = 5.88% 30/153 = 19.61% Class 3 0/153 = 0% 0/153 = 0% Total 5.88% 20.92% Source: Indelicato Decision Tree Example (C4.5)    C4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially needed for software implementation. Choice of best splitting attribute is based on an entropy calculation. These improvements include:  Choosing an appropriate attribute selection measure.  Handling training data with missing attribute values.  Handling attributes with differing costs.  Handling continuous attributes. Decision Tree Example (C4.5) Iris dataset Wine dataset Accuracy 97.67% Accuracy 86.7% Source: Siedler Decision Tree Example (C4.5) Diabetes dataset  C4.5 produces a complex tree (195 nodes)  The simplified (pruned) tree reduces the classification accuracy Before Pruning After Pruning Size Errors Size Errors 195 40 (5.2%) 69 102 (13.3%) Accuracy 94.8% 86.7% Association Rules Association rules are used to show the relationships between data items. Purchasing one product when another product is purchased is an example of an association rule. They do not represent any causality or correlation. Association Rule Techniques  Market Basket Analysis  Terminology  Transaction database  Association rule – implication {A, B} ═> {C}  Support - % of transactions in which {A, B, C} occurs  Confidence – ratio of the number of transactions that contain {A, B, C} to the number of transactions that contain {A, B} Association Rule Example 1984 United States Congressional Voting Records Database Attribute Information: Rules: 1. Class Name: 2 (democrat, republican) 2. handicapped-infants: 2 (y,n) 3. water-project-cost-sharing: 2 (y,n) 4. adoption-of-the-budget-resolution: 2 (y,n) 5. physician-fee-freeze: 2 (y,n) 6. El-Salvador-aid: 2 (y,n) 7. religious-groups-in-schools: 2 (y,n) 8. anti-satellite-test-ban: 2 (y,n) 9. aid-to-Nicaraguan-contras: 2 (y,n) 10. MX-missile: 2 (y,n) 11. immigration: 2 (y,n) 12. synfuels-corporation-cutback: 2 (y,n) 13. education-spending: 2 (y,n) 14. superfund-right-to-sue: 2 (y,n) 15. crime: 2 (y,n) 16. duty-free-exports: 2 (y,n) 17. export-administration-act-south-africa: 2 (y,n) {budget resolution = no, MX-missile = no, aid to El Salvador = yes}  {Republican} confidence 91.0% {budget resolution = yes, MX-missile = yes, aid to El Salvador = no}  {Democrat} confidence 97.5% {crime = yes, right-to-sue = yes, Physician fee freeze = yes}  {Republican} confidence 93.5% {crime = no, right-to-sue = no, Physician fee freeze = no}  {Democrat} confidence 100.0% Source: UCI Machine Learning Repository Clustering Clustering is similar to classification in that data are grouped. Unlike classification, the groups are not predefined; they are discovered. Grouping is accomplished by finding similarities between data according to characteristics found in the actual data. Clustering Techniques  K-Means Clustering  Neural Network Clustering (SOM) K-Means Example    The K-Means algorithm is an method to cluster objects based on their attributes into k partitions. It assumes that the k clusters exhibit normal distributions. The objective it tries to achieve is to minimize the variance within the clusters. K-Means Example Dataset Mean 1 Mean 2 Mean 3 Cluster 1 Cluster 2 Cluster 3 K-Means Example Iris dataset, only the petal width attribute, Accuracy 95.33% Cluster 1 Cluster 2 Cluster 3 46 Versicolor 3 Virginica Cluster mean 4.22857 4 Versicolor 47 Virginica Cluster mean 5.55686 50 Setosa Cluster mean 1.46275 Iris dataset, all attributes, Accuracy 66.0 % Cluster 1 Cluster 2 Cluster 3 47 Versicolor 49 Virginica Mean 6.30, 2.89, 4.96, 1.70 21 Setosa 1 Virginica Mean 4.59, 3.07, 1.44, 0.29 29 Setosa 3 Versicolor Mean 5.21, 3.53, 1.67, 0.35 Iris dataset, all attributes, Accuracy 90.67 % Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 23 Virginica 1 Virginica 26 Setosa 12 Virginica 24 Versicolor 1 Virginica 26 Versicolor 13 Virginica 24 Setosa Self-Organizing Map Example  The Self-Organizing Map was first described by the Finnish professor Teuvo Kohonen and is thus sometimes referred to as a Kohonen map.  SOM is especially good for visualizing high-dimensional data.  SOM maps input vectors onto a two-dimensional grid of nodes.  Nodes that are close together have similar attribute values and nodes that are far apart have different attribute values. Self-Organizing Map Example Z Y X Z Input Vectors X Y Self-Organizing Map Example Iris Data Virginica Virginica Virginica Versicolor Virginica Virginica Virginica Versicolor Virginica Virginica Virginica Virginica Virginica Virginica Setosa Setosa Setosa Versicolor Setosa Virginica Versicolor Virginica Versicolor Versicolor Versicolor Versicolor Versicolor Virginica Versicolor Versicolor Virginica Virginica Versicolor Setosa Setosa Setosa Setosa Setosa Setosa Setosa Setosa Setosa Setosa Setosa Versicolor Setosa Versicolor Versicolor Versicolor Virginica Virginica Virginica Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Virginica Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Versicolor Virginica Versicolor Versicolor Virginica Versicolor Versicolor Virginica Setosa Versicolor Self-Organizing Map Example Wine Data Class-2 Class-2 Class-2 Class-2 Class-2 Class-2 Class-2 Class-2 Class-2 Class-2 Class-2 Class-3 Class-3 Class-3 Class-3 Class-3 Class-2 Class-3 Class-3 Class-3 Class-3 Class-2 Class-3 Class-3 Class-2 Class-3 Class-3 Class-3 Class-1 Class-2 Class-1 Class-2 Class-3 Class-2 Class-3 Class-3 Class-2 Class-3 Class-3 Class-1 Class-3 Class-2 Class-3 Class-1 Class-3 Class-1 Class-1 Class-3 Class-1 Class-1 Class-1 Class-1 Class-2 Class-3 Class-2 Class-3 Class-1 Class-2 Class-2 Class-1 Class-1 Class-1 Class-1 Class-1 Class-1 Self-Organizing Map Example Diabetes Data Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Sick Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Sick Healthy Sick Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Sick Healthy Healthy Healthy Healthy Healthy Healthy Healthy Sick Sick Healthy Sick Sick Healthy Sick Healthy Healthy Healthy Sick Healthy Healthy Sick Sick Healthy Healthy Healthy Healthy Sick Sick Healthy Sick Healthy Sick Sick Healthy Healthy Sick Sick Sick Sick Sick Sick Healthy Sick Sick Healthy NFL Quarterback Analysis    Data from 2005 for 42 NFL quarterbacks Preprocessed data to normalize for a full 16 game regular season Used SOM to cluster individuals based on performance and descriptive data Source: McKee NFL Quarterback Analysis The SOM Map Source: McKee NFL Quarterback Analysis QB Passing Rating Overall Clustering Source: McKee NFL Quarterback Analysis The SOM Map Source: McKee Data Mining Stories - Revisited  Credit card fraud detection  NSA telephone network analysis  Supply chain management Social Issues of Data Mining  Impacts on personal privacy and confidentiality  Classification and clustering is similar to profiling  Association rules resemble logical implications  Data mining is an imperfect process subject to interpretation Conclusion  Why data mining?  Example data sets  Data mining methods  Example application of data mining  Social issues of data mining What on earth would a man do with himself if something did not stand in his way? - H.G. Wells I don’t think necessity is the mother of invention – invention, in my opinion, arises directly from idleness, probably also from laziness, to save oneself trouble. - Agatha Christie, from “An Autobiography, Pt III, Growing Up” References        Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Education, Inc., 2003 Fisher, R.A., The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics 7, pp. 179-188 Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006 Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Foundations of Data Mining, 2004 McKee , Kevin, The Self Organized Map Applied to 2005 NFL Quarterbacks, MATH 4200: Data Mining Foundations, 2006 Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Experimentation, MATH 4500: Foundations of Data Mining, 2004

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining Demystified ver2