Download Introduction to Data Mining

Data Mining – Day 2 Fabiano Dalpiaz Department of Information and Communication Technology University of Trento - Italy http://www.dit.unitn.it/~dalpiaz Database e Business Intelligence A.A. 2007-2008 Knowledge Discovery (KDD) Process Today Presented yesterday Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases © P. Giorgini, F. Dalpiaz 2 Outline  Data Mining techniques  Frequent patterns, association rules • Support and confidence  Classification and prediction • • • •    Decision trees Bayesian classifiers Support Vector Machines Lazy learning Cluster Analysis Visualization of the results Summary © P. Giorgini, F. Dalpiaz 3 Data Mining techniques © P. Giorgini, F. Dalpiaz 4 Frequent pattern analysis  What is it?    Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Frequent pattern analysis: searching for frequent patterns Motivation: Finding inherent regularities in data • Which products are bought together? Yesterday’s wine and spaghetti example • What are the subsequent purchases after buying a PC? • Can we automatically classify web documents?  Applications • • • • © P. Giorgini, F. Dalpiaz Basket data analysis Cross-marketing Catalog design Sale campaign analysis 5 Basic Concepts: Frequent Patterns and Association Rules (1) Transaction-id Items bought 1 Wine, Bread, Spaghetti 2 Wine, Cocoa, Spaghetti 3 Wine, Spaghetti, Cheese 4 Bread, Cheese, Sugar 5 Bread, Cocoa, Spaghetti, Cheese, Sugar Itemsets (= transactions in this example) Goal: find all rules of type X  Y between items in an itemset with minimum: Support s - probability that an itemset contains X  Y Confidence c – conditional probability that an itemset containing X contains also Y © P. Giorgini, F. Dalpiaz 6 Basic Concepts: Frequent Patterns and Association Rules (2) Transaction-id Items bought 1 Wine, Bread, Spaghetti 2 Wine, Cocoa, Spaghetti 3 Wine, Spaghetti, Cheese 4 Bread, Cheese, Sugar 5 Bread, Cocoa, Spaghetti, Cheese, Sugar Suppose: support s = 50% confidence c=50% Support is used to define frequent patterns (sets of products in more than s% itemsets) {Wine} in itemsets 1, 2, 3 (support = 60%) {Bread} in itemsets 1, 4, 5 (support = 60%) {Spaghetti} in itemsets 1, 2, 3, 5 (support = 80%) {Cheese} in itemsets 3, 4, 5 (support = 60%) {Wine, Spaghetti} in itemsets 1, 2, 3 (support = 60%) © P. Giorgini, F. Dalpiaz 7 Basic Concepts: Frequent Patterns and Association Rules (3) Transaction-id Items bought 1 Wine, Bread, Spaghetti 2 Wine, Cocoa, Spaghetti 3 Wine, Spaghetti, Cheese 4 Bread, Cheese, Sugar 5 Bread, Cocoa, Spaghetti, Cheese, Sugar Suppose: support s = 50% confidence c=50% Confidence defines association rules: X  Y rules in frequent patterns whose confidence is bigger than c Suggestion: {Wine, Spaghetti} is the only frequent pattern to be considered. Why? Association rules: Wine  Spaghetti (support=60%, confidence=100%) Spaghetti  Wine (support=60%, confidence=75%) © P. Giorgini, F. Dalpiaz 8 Advanced concepts in Association Rules discovery  Algorithms must face scalability problems   Apriori: If there is any itemset which is infrequent, its superset should not be generated/tested! Advanced problems   Boolean vs. quantitative associations age(x, “30..39”) and income(x, “42..48K”)  buys(x, “car”) [s=1%, c=75%] Single level vs. multiple-level analysis What brands of wine are associated with what brands of spaghetti? Are support and confidence clear? © P. Giorgini, F. Dalpiaz 9 Another example for association rules Transaction-id Items bought 1 Margherita, Beer, Coke 2 Margherita, Beer 3 Quattro stagioni, Coke 4 Margherita, Coke Frequent itemsets: {Margherita} = 75% {Beer} = 50% {Coke} = 75% {Margherita, Beer} = 50% {Margherita, Coke} = 50% © P. Giorgini, F. Dalpiaz Support s = 40% Confidence c = 70% Association rules: Beer  Margherita [c=50%,s=100%] 10 Classification vs. Prediction  Classification     Prediction   Characterizes (describes) a set of items belonging to a training set; these items are already classified according to a label attribute The characterization is a model The model can be applied to classify new data (predict the class they should belong to) models continuous-valued functions, i.e., predicts unknown or missing values Applications  Credit approval, target marketing, fraud detection © P. Giorgini, F. Dalpiaz 11 Classification: the process 1. Model construction    2. The class label attribute defines the class each item should belong to The set of items used for model construction is called training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage  Estimate accuracy of the model • •  On the training set On a generalization of the training set If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known © P. Giorgini, F. Dalpiaz 12 Classification: the process Model construction Classification Training Data NAME Mike Mary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no © P. Giorgini, F. Dalpiaz Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 13 Classification: the process IF rank = ‘professor’ Model usage OR years > 6 THEN tenured = ‘yes’ Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME T om M erlisa G eorge Joseph RANK YEARS TENURED A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes © P. Giorgini, F. Dalpiaz Tenured? 14 Supervised vs. Unsupervised Learning   Supervised learning (classification)  Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations  New data is classified based on the training set Unsupervised learning (clustering)  The class labels of training data is unknown  Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data © P. Giorgini, F. Dalpiaz 15 Evaluating generated models  Accuracy    Speed    handling noise and missing values Scalability   time to construct the model (training time) time to use the model (classification/prediction time) Robustness   classifier accuracy: predicting class label predictor accuracy: guessing value of predicted attributes efficiency in disk-resident databases Interpretability  understanding and insight provided by the model © P. Giorgini, F. Dalpiaz 16 Classification techniques Decision Trees (1) Investment type choice Income > 20K€ no yes Age > 60 Low risk no yes Married? no High risk © P. Giorgini, F. Dalpiaz Mid risk yes Mid risk 17 Classification techniques Decision Trees (2)  How are the attributes in decision trees selected?  Two well-known indexes are used • Information gain selects the most informative attribute in distinguishing the items between the classes • It biases towards attributes with a large set of values • Gain ratio faces the information gain limitations © P. Giorgini, F. Dalpiaz 18 Classification techniques Bayesian classifiers (2)  Bayesian classification  A statistical classification technique • Predicts class membership probabilities  Founded on the Bayes theorem P( X | H ) P( H ) P( H | X )  P( X ) • What if X = “Red and rounded” and H = “Apple”?  Performance • The simplest implementation (Naïve Bayes) can be compared to decision trees and neural networks  Incremental • Each training example can increase/decrease the probability that an hypothesis in correct © P. Giorgini, F. Dalpiaz 19 5 minutes break! © P. Giorgini, F. Dalpiaz 20 Classification techniques Support Vector Machines  One of the most advanced classification techniques    Left figure: a small margin between the classes is found Right figure: the largest margin is found Support vector machines (SVMs) are able to identify the right figure margin © P. Giorgini, F. Dalpiaz 21 Classification techniques SVMs + Kernel Functions  Is data always linearly separable?   NO!!! Solution: SVMs + Kernel Functions How to split this? © P. Giorgini, F. Dalpiaz SVM SVM + Kernel Functions 22 Classification techniques Lazy learning  Lazy learning     Simply stores training data (or only minor processing) and waits until it is given a test tuple Less time in training but more time in predicting Uses a richer hypothesis space (many local linear functions), and hence the accuracy is higher Instance-based learning    Subcategory of lazy learning Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified An example: k-nearest neighbor approach © P. Giorgini, F. Dalpiaz 23 Classification techniques k-nearest neighbor    All instances correspond to points in the n-Dimensional space – x is the instance to be classified The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2) For discrete-valued, k-NN returns the most common value among the k training examples nearest to x Which class should the green circle belong to? © P. Giorgini, F. Dalpiaz It depends on k!!! k=3  Red K=5  Blue 24 Prediction techniques An overview  Prediction is different from classification    Major method for prediction: regression   Classification refers to predict categorical class label Prediction models continuous-valued functions model the relationship between one or more independent or predictor variables and a dependent or response variable Regression analysis     Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees No details here © P. Giorgini, F. Dalpiaz 25 What is cluster Analysis?   Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters Cluster analysis  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  It belongs to unsupervised learning  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms (day 1 slides) © P. Giorgini, F. Dalpiaz 26 Examples of cluster analysis  Marketing:   Land use:   Identification of areas of similar land use in an earth observation database Insurance:   Help marketers discover distinct groups in their customer bases Identifying groups of motor insurance policy holders with a high average claim cost City-planning:  Identifying groups of houses according to their house type, value, and geographical location © P. Giorgini, F. Dalpiaz 27 Good clustering  A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity  Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)  The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables.  It is hard to define “similar enough” or “good enough” © P. Giorgini, F. Dalpiaz 28 A small example How to cluster this data? This process is not easy in practice. Why? © P. Giorgini, F. Dalpiaz 29 Visualization of the results  Presentation of the results or knowledge obtained from data mining in visual forms  Examples  Scatter plots  Association rules  Decision trees  Clusters © P. Giorgini, F. Dalpiaz 30 Scatter plots (SAS Enterprise miner) © P. Giorgini, F. Dalpiaz 31 Association rules (SGI/Mineset) © P. Giorgini, F. Dalpiaz 32 Decision trees (SGI/Mineset) © P. Giorgini, F. Dalpiaz 33 Clusters (IBM Intelligent Miner) © P. Giorgini, F. Dalpiaz 34 Summary Why Data Mining? Data Mining and KDD Data preprocessing Some scenarios Classification Clustering © P. Giorgini, F. Dalpiaz 35

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to Data Mining