Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data mining and Machine Learning Sunita Sarawagi [email protected] Data Mining Data mining is the process of semi-automatically analyzing large databases to find useful patterns Prediction based on past history Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history Predict if a pattern of phone calling card usage is likely to be fraudulent Some examples of prediction mechanisms: Classification Given a new item whose class is unknown, predict to which class it belongs Regression formulae Given a set of mappings for an unknown function, predict the function result for a new parameter value Data Mining (Cont.) Descriptive Patterns Associations Associations may be used as a first step in detecting causation Find books that are often bought by “similar” customers. If a new such customer buys one such book, suggest the others too. E.g. association between exposure to chemical X and cancer, Clusters E.g. typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics Data mining Data: of various shapes and sizes Patterns/Model: of various shapes and sizes Abstraction of data into some understandable and useful Basic structure of data Set of instances/objects/cases/rows/points/examples Each instance: fixed set of attributes/dimensions/columns Continuous Categorical Patterns: Express one attribute as a function of another: Classification, regression Group together related instances: clustering, projection, factorization, itemset mining Classification Given old data about customers and payments, predict new applicant’s loan eligibility. Previous customers Age Salary Profession Location Class Customer type label Labeled data Model Classifier Decision rules Salary > 5 L Prof. = Exec Good/ bad Deployment Training New customer’s data Unlabeled data Applications Ad placement in search engines Book recommendation Citation databases: Google scholar, Citeseer Resume organization and job matching Retail data mining Banking: loan/credit card approval Customer relationship management: identify those who are likely to leave for a competitor. Targeted marketing: predict good customers based on old customers identify likely responders to promotions Machine translation Speech and handwriting recognition Fraud detection: telecommunications, financial transactions from an online stream of event identify fraudulent events Applications (continued) Medicine: disease outcome, effectiveness of treatments Molecular/Pharmaceutical: identify new drugs Scientific data analysis: analyze patient disease history: find relationship between diseases identify new galaxies by searching for sub clusters Image and vision: Object recognition from images Remove noise from images Identifying scene breaks The KDD process Problem fomulation Data collection Pre-processing: cleaning name/address cleaning, different meanings (annual, yearly), duplicate removal, supplying missing values Transformation: subset data: sampling might hurt if highly skewed data feature selection: principal component analysis, heuristic search map complex objects e.g. time series data to features e.g. frequency Choosing mining task and mining method: Result evaluation and Visualization: Knowledge discovery is an iterative process Mining products Data warehouse Extract data via ODBC Preprocessing utilities •Sampling •Attribute transformation Commercial Tools – – – – SAS Enterprise miner SPSS IBM Intelligent miner Microsoft SQL Server Data mining services – Oracle data mining (ODM) Scalable algorithms • association • classification • clustering • sequence mining Mining operations Visualization Tools Free Weka Individual algorithms Mining operations Clustering Sequence mining Classification hierarchical Time series Regression similarity Classification trees EM Neural networks density based Temporal patterns Bayesian learning Nearest neighbour Itemset mining Radial basis functions Association rules Support vector Causality machines Meta learning methods Bagging,boosting Sequential classification Graphical models Hidden Markov Models Classification methods Goal: Predict class Ci = f(x1, x2, .. Xn) Regression: (linear or any other polynomial) Decision tree classifier: divide decision space into piecewise constant regions. Neural networks: partition by non-linear boundaries Probabilistic/generative models Lazy learning methods: nearest neighbor Support vector machines: boundary to maximally separate classes Decision tree learning Decision tree classifiers Widely used learning method Easy to interpret: can be re-represented as if-then-else rules Approximates function by piece wise constant regions Does not require any prior knowledge of data distribution, works well on noisy data. Has been applied to: classify medical patients based on the disease, equipment malfunction by cause, loan applicant by likelihood of payment. lots and lots of other applications.. Decision trees Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Salary < 1 M Prof = teaching Good Bad Age < 30 Bad Good Training Dataset This follows an example from Quinlan’s ID3 age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Output: A Decision Tree for “buys_computer” age? <=30 student? overcast 30..40 yes >40 credit rating? no yes excellent fair no yes no yes Weather Data: Play or not Play? Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No overcast hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No Note: Outlook is the Forecast, no relation to Microsoft email program Example Tree for “Play?” Outlook sunny overcast Humidity Yes rain Windy high normal true false No Yes No Yes Topics to be covered Tree construction: Tree pruning: Basic tree learning algorithm Measures of predictive ability High performance decision tree construction: Sprint Why prune Methods of pruning Other issues: Handling missing data Continuous class labels Effect of training size Tree learning algorithms ID3 (Quinlan 1986) Successor C4.5 (Quinlan 1993) CART SLIQ (Mehta et al) SPRINT (Shafer et al) Basic algorithm for tree building Greedy top-down construction. Gen_Tree (Node, data) make node a leaf? Yes Stop Selection Find best attribute and best split on attribute criteria Partition data on split condition For each child j of node Gen_Tree (node_j, data_j) Split criteria Select the attribute that is best for classification. Intuitively pick one that best separates instances of different classes. Quantifying the intuitive: measuring separability: First define impurity of an arbitrary set S consisting of K classes Smallest when consisting of only one class, highest when all classes in equal number. 1 Should allow computations in multiple stages. Measures of impurity Entropy k Entropy( S ) pi log pi i 1 Gini k Gini ( S ) 1 pi i 1 2 Information gain 0.5 Gini Entropy 1 0 p1 1 0 1 Information gain on partitioning S into r subsets Impurity (S) - sum of weighted impurity of each subset r Gain( S , S1..S r ) Entropy( S ) j 1 Sj S Entropy( S j ) Information gain: example K= 2, |S| = 100, p1= 0.6, p2= 0.4 E(S) = -0.6 log(0.6) - 0.4 log (0.4)=0.29 | S1 | = 70, p1= 0.8, p2= 0.2 E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21 S1 S | S | = 30, p = 0.13, p = 0.87 2 1 2 E(S2) = -0.13log0.13 - 0.87 log 0.87=.16 S2 Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1 Weather Data: Play or not Play? Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No overcast hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No Which attribute to select? witten&eibe Example: attribute “Outlook” “Outlook” = “Sunny”: info([2,3]) entropy(2/5,3/5) 2 / 5 log( 2 / 5) 3 / 5 log(3 / 5) 0.971 bits “Outlook” = “Overcast”: Note: log(0) is info([4,0]) entropy(1,0) 1log(1) 0 log(0) 0 bits not defined, but “Outlook” = “Rainy”: we evaluate 0*log(0) as zero info([3,2]) entropy(3/5,2/5) 3 / 5 log(3 / 5) 2 / 5 log( 2 / 5) 0.971 bits Expected information for attribute: info([3,2],[4,0],[3,2]) (5 / 14) 0.971 (4 / 14) 0 (5 / 14) 0.971 0.693 bits witten&eibe Computing the information gain Information gain: (information before split) – (information after split) gain(" Outlook" ) info([9,5] ) - info([2,3] , [4,0], [3,2]) 0.940 - 0.693 0.247 bits Information gain for attributes from weather gain(" Outlook" ) 0.247 bits data: gain(" Temperatur e" ) 0.029 bits gain(" Humidity" ) 0.152 bits gain(" Windy" ) 0.048 bits witten&eibe Continuing to split gain(" Humidity" ) 0.971 bits gain(" Temperatur e" ) 0.571 bits gain(" Windy" ) 0.020 bits witten&eibe The final decision tree Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further witten&eibe Preventing overfitting A tree T overfits if there is another tree T’ that gives higher error on the training data yet gives lower error on unseen data. An overfitted tree does not generalize to unseen instances. Happens when data contains noise or irrelevant attributes and training size is small. Overfitting can reduce accuracy drastically: 10-25% as reported in Minger’s 1989 Machine learning Example of over-fitting with binary data. Training Data Vs. Test Data Error Rates Compare error rates measured by learn data large test set Learn R(T) always decreases as tree grows (Q: Why?) Test R(T) first declines then increases (Q: Why?) Overfitting is the result tree of too much reliance on learn R(T) Can lead to disasters when applied to new data No. Terminal Nodes 71 63 58 40 34 19 **10 9 7 6 5 2 1 R(T) Rts(T) .00 .00 .03 .10 .12 .20 .29 .32 .41 .46 .53 .75 .86 .42 .40 .39 .32 .32 .31 .30 .34 .47 .54 .61 .82 .91 Digit recognition dataset: CART book Overfitting example Consider the case where a single attribute xj is adequate for classification but with an error of 20% Consider lots of other noise attributes that enable zero error during training This detailed tree during testing will have an expected error of (0.8*0.2 + 0.2*0.8) = 32% whereas the pruned tree with only a single split on xj will have an error of only 20%. Approaches to prevent overfitting Two Approaches: Stop growing the tree beyond a certain point Tricky, since even when information gain is zero an attribute might be useful (XOR example) First over-fit, then post prune. (More widely used) Tree building divided into phases: Growth phase Prune phase size: Three criteria: Cross validation with separate test data Statistical bounds: use all data for training but apply statistical test to decide right size. (cross-validation dataset may be used to threshold) Use some criteria function to choose best size Example: Minimum description length (MDL) criteria Cross validation Partition the dataset into two disjoint parts: Rule of thumb: 2/3rds training, 1/3rd validation Evaluate the tree on the validation set and at each leaf and internal node keep count of correctly labeled data. 1. Training set used for building the tree. 2. Validation set used for pruning the tree: Starting bottom-up, prune nodes with error less than its children. What if training data set size is limited? n-fold cross validation: partition training data into n parts D1, D2…Dn. Train n classifiers with D-Di as training and Di as test instance. Pick average. (how?) Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF IF IF IF IF age = “<=30” AND student = “no” THEN buys_computer = “no” age = “<=30” AND student = “yes” THEN buys_computer = “yes” age = “31…40” THEN buys_computer = “yes” age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” Rule-based pruning Tree-based pruning limits the kind of pruning. If a node is pruned all subtrees under it has to be pruned. Rule-based: For each leaf of the tree, extract a rule using a conjuction of all tests upto the root. On the validation set, independently prune tests from each rule to get the highest accuracy for that rule. Sort rule by decreasing accuracy.. Regression trees Decision tree with continuous class labels: Regression trees approximates the function with piece-wise constant regions. Split criteria for regression trees: Predicted value for a set S = average of all values in S Error: sum of the square of error of each member of S from the predicted average. Pick smallest average error. Splits on categorical attributes: Can it be better than for discrete class labels? Homework. Other types of trees Multi-way trees on low-cardinality categorical data Multiple splits on continuous attributes [Fayyad 93, Multi-interval discretization of continuous attributes] Multi attribute tests on nodes to handle correlated attributes multivariate linear splits [Oblique trees, Murthy 94] Issues Methods of handling missing values assume majority value take most probable path Allowing varying costs for different attributes Pros and Cons of decision trees • Cons • Pros + Reasonable training time – Not effective for very high dimensional data where + Fast application information about the class is + Easy to interpret spread in small ways over + Easy to implement many correlated features + Intuitive –Example: words in text classification –Not robust to dropping of important features even when correlated substitutes exist in data The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean distance. The target function could be discrete- or real- valued. For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq. Vonoroi diagram: the decision surface induced by 1NN for a typical set of training examples. _ + _ _ _ _ . + xq . _ + . . . + From Jiawei Han's slides . Other lazy learning methods Locally weighted regression: learn a new regression equation by weighting each training instance based on distance from new instance Radial Basis Functions • Pros + Fast training • Cons – Slow during application. – No feature selection. – Notion of proximity vague Bayesian learning Assume a probability model on generation of data. Apply bayes theorem to find most likely class as: p(d | c j ) p(c j ) c max p(c j | d ) max cj cj p(d ) Naïve bayes: Assume attributes conditionally independent given class value c max cj p(c j ) n p(a p(d ) i 1 i | cj) Easy to learn probabilities by counting, one pass counting Useful in some domains e.g. text. Numeric attributes must be discretized Bayesian belief network Find joint probability over set of variables making use of conditional independence whenever known a d ad ad ad ad b Variable e independent of d given b e b 0.1 0.2 0.3 0.4 b 0.3 0.2 0.1 0.5 C Learning parameters hard when hidden units: use gradient descent / EM algorithms Learning structure of network harder Neural networks Useful for learning complex data like handwriting, speech and image recognition Decision boundaries: Linear regression Classification tree Neural network Pros and Cons of Neural Network • Pros + Can learn more complicated class boundaries + Fast application + Can handle large number of features • Cons – Slow training time – Hard to interpret – Hard to implement: trial and error for choosing number of nodes Conclusion: Use neural nets only if decision trees/NN fail. Linear discriminants Problem setting Given a binary classification problem with points in d dimensions Training data n vectors with predictions of the form: (x1,y1),…(xn,yn) Each y will take value 1 or -1 Goal is to learn a function of the form: F(x) = w.x + b = w1x1+w2x2+…+wdxd + b Linear regression Developed for the case of real-valued y y = f(x) = w.x+w0 Rewrite as y = w.x with x-s padded with a 1. Error: Minimize error by differentiating wrt w Minimum reached at w=(X’X)-1X’Y Fisher’s linear discriminant Find hyperplane (w,b) on which the projection of the data is maximally separated. Cost function: Where mi and si are the mean and standard deviation of the projected points w.x+b for all point x in class i , pi is fraction in class i. The linear discriminate maximizes above separation when w = (m1-m2)’S-1x where m1 is the mean of the x values along each class and S-1 is the covariance matrix of the data. b = (m1-m2)’S-1(m1+m2) (mid-point of the two means on the linear discriminate) Fisher’s discriminant This maximizes separation between projected red and black points average fj fi Shortcomings Perceptron: ill-posed. several values of w might yield the same zero error on training data Support vector machines Binary classifier: find hyper-plane providing maximum margin between vectors of the two classes fj fi Support vector machines Separators with larger margin will have smaller generalization error Geometry of SVMs 1/||w|| fj w -b ||w|| fi wx+b ||w|| Linear separators Most complex real-world applications require more than linear separators One way to get around the problem is to represent the data in a transformed coordinate space on which linear separators can be learnt. Example f(m1,m2,r)=Cm1m2/r2 is not linear but f(ln m1, ln m2, ln r) is. Support Vector Machines Extendable to: Good generalization performance Non-separable problems (Cortes & Vapnik, 1995) Non-linear classifiers (Boser et al., 1992) OCR (Boser et al.) Vision (Poggio et al.) Text classification (Joachims) Requires tuning: which kernel, what parameters? Several freely available packages: SVMTorch Locally weighted regression Learn a new regression equation by weighting each training instance based on distance from new instance Locally Weighted Regression Regression equation Find wi-s so as to minimize error: f ( x) w w a ( x)wnan ( x) 0 11 E( D) 1 ( f ( x) fˆ ( x))2 2 xD Construct an explicit approximation to f over a local region surrounding query instance xq. The target function f is approximated near xq using the linear function: minimize the squared error: distance-decreasing weight K E ( xq ) 1 ( f ( x) f ( x))2 K(d ( xq , x)) 2 xk _nearest _neighbors_of _ x q Feature subset selection Embedded: in the mining algorithm Filter: features in advance Wrapper: generate candidate features and test on black box mining algorithm high cost but provide better algorithm dependent features Meta learning methods No single classifier good under all cases Difficult to evaluate in advance the conditions Meta learning: combine the effects of the classifiers Voting: sum up votes of component classifiers Combiners: learn a new classifier on the outcomes of previous ones: Boosting: staged classifiers Disadvantage: interpretation hard Knowledge probing: learn single classifier to mimick meta classifier Clustering or Unsupervised learning Applications Customer segmentation e.g. for targeted marketing Collaborative filtering: Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. Identify micro-markets and develop policies foreach group based on common items purchased Image tiling Text clustering e.g. scatter/gather Compression Distance functions Numeric data: euclidean, manhattan distances Categorical data: 0/1 to indicate presence/absence Minkowski metric: [sum(xi-yi)^m]^(1/m) Larger m gives higher weight to larger distances Euclidean distance: equal weightage to 1 and 0 match Hamming distance (# dissimilarity) Jaccard coefficients: #similarity in 1s/(# of 1s) (0-0 matches not important data dependent measures: similarity of A and B depends on cooccurance with C. Combined numeric and categorical data:weighted normalized distance: Distance functions on high dimensional data Example: Time series, Text, Images Euclidian measures make all points equally far Reduce number of dimensions: choose subset of original features using random projections, feature selection techniques transform original features using statistical methods like Principal Component Analysis Define domain specific similarity measures: e.g. for images define features like number of objects, color histogram; for time series define shape based measures. Define non-distance based (model-based) clustering methods: Clustering methods Hierarchical clustering agglomerative Vs divisive single link Vs complete link Partitional clustering distance-based: K-means model-based: EM density-based: A Dendrogram Shows How the Clusters are Merged Hierarchically Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. divisive Step 4 c Step 3 Step 2 Step 1 Step 0 ac 0 a 0 2 bde 0 9 8 b 0 7 5 10 abcde d e Step 2 Step 3 Step 4 Step 2 Step 1 Step 0 abcde c de Step 1 Step 3 ac bde Step 2 Step 3 Step 4 b d Step 0 Step 4 de Step 1 agglomerative c a b e d Step 0 Single-link a b c d e a 0 9 3 6 11 e Complete-link Pros and Cons Single link: Complete link confused by near overlap chaining effect unnecessary splits of alongated point clouds sensitive to outliers Several other hierarchical methods known: Partitional methods: K-means Criteria: minimize sum of square of distance Between each point and centroid of the cluster. Between each pair of points in the cluster Algorithm: Select initial partition with K clusters: random, first K, K separated points Repeat until stabilization: Assign each point to closest cluster center Generate new cluster centers Adjust clusters by merging/splitting Association rules Given set T of groups of items T Example: set of baskets of items Milk, cereal purchased Tea, milk Goal: find all rules on itemsets of the form Tea, rice, bread a-->b such that support of a and b > user threshold s conditional probability (confidence) of b given a > user threshold c Example: Milk --> bread Lot of work done on scalable algorithms cereal Variants High confidence may not imply high correlation Use correlations. Find expected support and large departures from that interesting. Brin et al. Limited attempt. More complete work in statistical literature on contingency tables. Still too many rules, need to prune... Does not imply causality as in Bayesian networks Applications of fast itemset counting Find correlated events: Applications in medicine: find redundant tests Cross selling in retail, banking Improve predictive capability of classifiers that assume attribute independence New similarity measures of categorical attributes [Mannila et al, KDD 98] Temporal mining Several large data domains inherently temporal Stock prices Monitoring data: patient monitors, manufacturing processes, performance logs Transaction data Lots of prior work from Signal processing Statistics Speech recognition Temporal mining Finding significant patterns along time Similarity matches and clustering Rules along time series: Classification on time series data: Drop in kerosene prices --> increase in bronchitis cases customers with high variance in balance likely to default. speed fluctuations with significant third order ARMA coefficients are probably from drunk drivers. Detecting drift in models along Spatial scan statistics (Paper in reading list) Mining market Around 20 to 30 mining tool vendors Major tool players: Clementine, IBM’s Intelligent Miner, SGI’s MineSet, SAS’s Enterprise Miner. All pretty much the same set of tools Many embedded products: fraud detection: electronic commerce applications, health care, customer relationship management: Epiphany Summary What is data mining and an overview of the various operations: Classification: regression, nearest neighbour, neural network, bayesian Clustering: distance based (k-means), distribution based(EM) Itemset counting Several operations: challenge is choosing the right operation for the problem Resources http://www.kdnuggets.com SIGKDD: http://www.acm.org/sigkdd