Download X - CmpE

Intelligent Data Mining Ethem Alpaydın Department of Computer Engineering Boğaziçi University [email protected] What is Data Mining ? • Search for very strong patterns (correlations, dependencies) in big data that can generalise to accurate future decisions. • Aka Knowledge discovery in databases, Business Intelligence Example Applications • Association “30% of customers who buy diapers also buy beer.” Basket Analysis • Classification “Young women buy small inexpensive cars.” “Older wealthy men buy big cars.” • Regression Credit Scoring Example Applications • Sequential Patterns “Customers who latepay two or more of the first three installments have a 60% probability of defaulting.” • Similar Time Sequences “The value of the stocks of company X has been similar to that of company Y’s.” Example Applications • Exceptions (Deviation Detection) “Is any of my customers behaving differently than usual?” • Text mining (Web mining) “Which documents on the internet are similar to this document?” IDIS – US Forest Service • Identifies forest stands (areas similar in age, structure and species composition) • Predicts how different stands would react to fire and what preventive measures should be taken? GTE Labs • KEFIR (Key findings reporter) • Evaluates health-care utilization costs • Isolates groups whose costs are likely to increase in the next year. • Find medical conditions for which there is a known procedure that improves health condition and decreases costs. Lockheed • RECON Stock portfolio selection • Create a portfolio of 150-200 securities from an analysis of a DB of the performance of 1,500 securities over a 7 years period. VISA • Credit Card Fraud Detection • CRIS: Neural Network software which learns to recognize spending patterns of card holders and scores transactions by risk. “If a card holder normally buys gas and groceries and the account suddenly shows purchase of stereo equipment in Hong Kong, CRIS sends a notice to bank which in turn can contact the card holder.” ISL Ltd (Clementine) - BBC • Audience prediction • Program schedulers must be able to predict the likely audience for a program and the optimum time to show it. • Type of program, time, competing programs, other events affect audience figures. Data Mining is NOT Magic! Data mining draws on the concepts and methods of databases, statistics, and machine learning. From the Warehouse to the Mine Data Warehouse Transactional Databases Extract, transform, cleanse data Standard form Define goals, data transformations How to mine? Verification Discovery Computer-assisted, User-directed, Top-down Automated, Data-driven, Bottom-up Query and Report OLAP (Online Analytical Processing) tools Steps: 1. Define Goal • Associations between products ? • New market segments or potential customers? • Buying patterns over time or product sales trends? • Discriminating among classes of customers ? Steps: 2. Prepare Data • Integrate, select and preprocess existing data (already done if there is a warehouse) • Any other data relevant to the objective which might supplement existing data Steps: 2. Prepare Data (Cont’d) • Select the data: Identify relevant variables • Data cleaning: Errors, inconsistencies, duplicates, missing data. • Data scrubbing: Mappings, data conversions, new attributes • Visual Inspection: Data distribution, structure, outliers, correlations btw attributes • Feature Analysis: Clustering, Discretization Steps: 3. Select Tool • Identify task class Clustering/Segmentation, Association, Classification, Pattern detection/Prediction in time series • Identify solution class Explanation (Decision trees, rules) vs Black Box (neural network) • Model assesment, validation and comparison k-fold cross validation, statistical tests • Combination of models Steps: 4. Interpretation • Are the results (explanations/predictions) correct, significant? • Consultation with a domain expert Example • Data as a table of attributes Name Ali Veli Income 25,000 $ 18,000 $ Owns a house? Yes No Marital status Married Married Default No Yes We would like to be able to explain the value of one attribute in terms of the values of other attributes that are relevant. Modelling Data Attributes x are observable y =f (x) where f is unknown and probabilistic x f y Building a Model for Data x y f f* - Learning from Data Given a sample X={xt,yt}t we build f*(xt) a predictor to f (xt) that minimizes the difference between our prediction and actual value E   y t  f * (x t ) 2 t Types of Applications • Classification: y in {C1, C2,…,CK} • Regression: y in Re • Time-Series Prediction: x temporally dependent • Clustering: Group x according to similarity savings Example OK DEFAULT Yearly income x2 : savings Example Solution OK DEFAULT q2 x1 : yearly-income q1 RULE: IF yearly-income> q1 AND savings> q2 THEN OK ELSE DEFAULT Decision Trees x1 > q1 yes no x2 > q2 yes y=1 no y=0 y=0 x1 : yearly income x2 : savings y = 0: DEFAULT y = 1: OK savings Clustering Type 1 Type 2 OK DEFAULT Type 3 yearly-income Time-Series Prediction ? time Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Discovery of frequent episodes Past Present Future Accept best if good enough Methodology Initial Standard Form Train set Predictor 1 Predictor 2 Choose best Predictor L Data reduction: Value and feature Reductions Test set Train alternative predictors on train set Best Predictor Test trained predictors on test data and choose best Data Visualisation • Plot data in fewer dimensions (typically 2) to allow visual analysis • Visualisation of structure, groups and outliers savings Data Visualisation Rule Exceptions Yearly income Techniques for Training Predictors • • • • Parametric multivariate statistics Memory-based (Case-based) Models Decision Trees Artificial Neural Networks Classification • • • • x : d-dimensional vector of attributes C1 , C2 ,... , CK : K classes Reject or doubt Compute P(Ci|x) from data and choose k such that P(Ck|x)=maxj P(Cj|x) Bayes’ Rule p(x|Cj) P(Cj) p(x) P(Cj|x) : : : : likelihood that an object of class j has its features x prior probability of class j probability of an object (of any class) with feature x posterior probability that object with feature x is of class j Statistical Methods • Parametric e.g., Gaussian, model for class densities, p(x|Cj) Univariate x   2   ( x   ) 1 j exp   2 2  j 2 j   p (x | C j )  Multivariate x  d 1  1  T 1 p (x | C j )  exp  ( x  μ j ) Σ j (x  μ j ) d /2 (2 ) Σ j  2  Training a Classifier • Given data {xt}t of class Cj Univariate: p(x|Cj) is N (j,j2) xt ̂ j  x t C j nj  (x t  ˆ j ) 2 ˆ j  2 x t C j nj 1 Pˆ(C j )  Multivariate: p(x|Cj) is Nd (j,Sj) t x  μ̂ j  x t C j nj Sˆ j  2 t t T ˆ ˆ ( x  μ )( x  μ )  j j xt C j nj 1 nj n Example: 1D Case Example: Different Variances Example: Many Classes 2D Case: Equal Spheric Classes Shared Covariances Different Covariances Actions and Risks ai : Action i l(ai|Cj) : Loss of taking action ai when the situation is Cj R(ai |x) = Sj l(ai|Cj) P(Cj |x) Choose ak st R(ak |x) = mini R(ai |x) Function Approximation (Scoring) Regression t t y  f (x | q )  e where e is noise. In linear regression, Find w,w0 st f (x t | w ,w 0 )  wx t  w 0 t t 2 E (w ,w 0 )   (y  wx  w 0 ) t E w E E  0, 0 w w 0 Linear Regression Polynomial Regression • E.g., quadratic t f (x | w 2 ,w 1 ,w 0 )  w 2 x t2 E (w 2 ,w 1 ,w 0 )   (y  w 2 x t t  w 1x t  w 0 t2  w 1 x t  w 0 )2 Polynomial Regression Multiple Linear Regression • d inputs: t 1 t 2 t f (x , x , , x d | w 0 ,w 1 ,w 2 ,  ,w d )  t 1 1 t 2 t w x  w 2 x    w d x d  w 0  wT x E (w 0 ,w 1 ,w 2 ,  ,w d )  t y t t 1 t 2 t  f (x , x ,  , x d | w 0 ,w 1 ,w 2 ,  ,w d )  2 Feature Selection • Subset selection Forward and backward methods • Linear Projection Principal Components Analysis (PCA) Linear Discriminant Analysis (LDA) Sequential Feature Selection Forward Selection (x1) (x2) (x 1 x 3 ) (x3) (x4) (x2 x3) (x 1 x 2 x 3 ) (x3 x4) (x2 x3 x4) Backward Selection (x 1 x 2 x 3 x 4 ) (x1 x2 x3) (x1 x2 x4) (x1 x3 x4) (x2 x3 x4) (x2 x4) (x1 x4) (x1 x2) Principal Components Analysis (PCA) x2 z2 z2 z1 x1 Whitening transform z1 Linear Discriminant Analysis (LDA) x2 z1 z1 x1 Memory-based Methods • Case-based reasoning • Nearest-neighbor algorithms • Keep a list of known instances and interpolate response from those Nearest Neighbor x2 x1 Local Regression y x Mixture of Experts Missing Data • Ignore cases with missing data • Mean imputation • Imputation by regression Training Decision Trees x2 x1 > q1 yes no x2 > q2 yes y=1 no y=0 q2 y=0 q1 x1 Measuring Disorder x2 x2 q 7 0 q x1 1 9 8 5 0 4 x1 Entropy n right n left n left n right e log  log n n n n Artificial Neural Networks x0=+1 x1 x2 w1 w2 g wd xd w0 y  g (x 1w 1  x 2w 2    w 0 ) y  g ( wT x) Regression: Identity Classification: Sigmoid (0/1) Training a Neural Network • d inputs: d o  g ( w x)  g  w i x i  i 0 T Training set:    X  x , y t t  Find w that min E on X E (w | X )  y  t X  t o  t 2  t     y  g  w i x i t X   i     2 Nonlinear Optimization E wı E w i   w i Gradient-descent: Iterative learning Starting from random w  is learning factor Neural Networks for Classification K outputs oj , j=1,..,K Each oj estimates P (Cj|x) o j  sigmoid ( wTj x ) 1  1  exp(  wTj x ) Multiple Outputs o2 o1 oK wKd x0=+1 x1 x2 xd d  o tj  g ( wTj xt )  g  w ji x it  i 0    Iterative Training X  xt , yt  E ( w | X )    y  o t j t j  t 2 j o tj  g ( wTj x t ) w ji E E o j        y tj  o tj g ' ()x i w ji o j w ji t     Linear Nonlinear w ji   y tj  o tj o tj (1  o tj )x i w ji   y tj  o tj x i Nonlinear classification Linearly separable NOT Linearly separable; requires a nonlinear discriminant Multi-Layer Networks o2 o1 oK tKH h2 h1 hH h0=+1 x0=+1 x1 x2 wKd xd  H t  o  g   t jp h p   p 0   d t h p  sigmoid  w pi x it  i 0 t j    Probabilistic Networks p (a )  0.1 p ( | a )  0.05, p ( | a )  0.1,... Evaluating Learners 1. Given a model M, how can we assess its performance on real (future) data? 2. Given M1, M2, ..., ML which one is the best? Cross-validation 1 1 2 3 2 k-1 k 3 k-1 Repeat k times and average k Combining Learners: Why? Initial Standard Form Train set Predictor 1 Predictor 2 Predictor L Validation set Choose best Best Predictor Combining Learners: How? Initial Standard Form Train set Predictor 1 Predictor 2 Predictor L Validation set Voting Conclusions: The Importance of Data • Extract valuable information from large amounts of raw data • Large amount of reliable data is a must. The quality of the solution depends highly on the quality of the data • Data mining is not alchemy; we cannot turn stone into gold Conclusions: The Importance of the Domain Expert • Joint effort of human experts and computers • Any information (symmetries, constraints, etc) regarding the application should be made use of to help the learning system • Results should be checked for consistency by domain experts Conclusions: The Importance of Being Patient • Data mining is not straightforward; repeated trials are needed before the system is finetuned. • Mining may be lengthy and costly. Large expectations lead to large disappointments ! Once again: Important Requirements for Mining • • Large amount of high quality data Devoted and knowledgable experts on: 1. Application domain 2. Databases (Data warehouse) 3. Statistics and Machine Learning • Time and patience That’s all folks!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download X - CmpE