Download Data Mining Tutorial

Handling the Data Deluge Tools for the era of “Big Data” D. A. Dickey April 2012 Data Mining - What is it? • Collection of methods for “big data” • Fast methods • Topics – Association Analysis – Trees (recursive splitting) – Logistic Regression – Neural Networks – Nearest Neighbor – Clustering – Text Mining, Etc. Association Analysis • Market basket analysis – What they’re doing when they scan your “VIP” card at the grocery – People who buy diapers tend to also buy _________ (beer?) – Just a matter of accounting but with new terminology (of course  ) Does diaper purchase predict beer purchase? • Contingency tables Beer Yes No diapers diapers Beer No Yes No 6 94 100 23 77 40 60 100 23 77 DEPENDENT (yes) INDEPENDENT (no predictability) Termnilogy • • • • • • Baskets: ABC ACD BCD ADE BCE Rule Support Confidence X=>Y Pr{X and Y} Pr{Y|X} A=>D 2/5 2/3 C=>A 2/5 2/4 B&C=>D 1/5 1/3 ABC ACD BCD ADE BCE Don’t be Fooled! • Lift = Confidence /Expected Confidence if Independent Checking Saving No (1500) Yes (8500) (10000) No 500 3500 4000 Yes 1000 5000 6000 Observed Confidence (savings=>checking) is 5000/6000 = 83% Sounds great but …. Expect 8500/10000 = 85% if independent Lift = 83/85 < 1. Savings account holders actually LESS likely than others to have checking account !!! Trees • • • • • • • • A “divisive” method (splits) Start with “root node” – all in one group Get splitting rules Response often binary Result is a “tree” Example: Loan Defaults Example: Framingham Heart Study Example: Automobile fatalities Recursive Splitting Pr{default} =0.008 Pr{default} =0.012 Pr{default} =0.006 X1=Debt To Income Ratio Pr{default} =0.0001 Pr{default} =0.003 No default Default X2 = Age Some Actual Data • Framingham Heart Study • First Stage Coronary Heart Disease – P{CHD} = Function of: • Age - no drug yet!  • Cholesterol • Systolic BP Import Example of a “tree” All 1615 patients Split # 1: Age Systolic BP “terminal node” How to make splits? • Which variable to use? • Where to split? – Cholesterol > ____ – Systolic BP > _____ • Goal: Pure “leaves” or “terminal nodes” • Ideal split: Everyone with BP>x has problems, nobody with BP<x has problems First review Chi Square test • Contingency tables Heart Disease No Yes Low BP High BP Heart Disease No Yes 95 5 100 75 25 55 45 100 75 25 DEPENDENT (yes) INDEPENDENT (no) c2 Test Statistic • Expect 100(150/200)=75 in upper left if independent (etc. e.g. 100(50/200)=25) Heart Disease No Yes Low BP High BP (observed  exp ected ) 2 c  allcells exp ected 2 95 (75) 55 (75) 5 (25) 45 (25) 100 150 50 200 100 WHERE IS HIGH BP CUTOFF??? 2(400/75)+ 2(400/25) = 42.67 Compare to Tables – Significant! (Significant ???) Conclusion: Sufficient evidence the hypothesis of no relationship. H0: H1: H0: Innocence H1: Guilt 95 (75) 55 (75) 5 (25) 45 (25) Beyond reasonable doubt P<0.05 H0: No association P=0.00000000064 Measuring “Worth” of a Split • P-value is probability of Chi-square as great as that observed if independence is true. (Pr {c2>42.67} is 6.4E-11) • P-values all too small. • Logworth = -log10(p-value) = 10.19 • Best Chi-square  max logworth. Logworth for Age Splits ? Age 47 maximizes logworth How to make splits? • Which variable to use? • Where to split? – Cholesterol > ____ – Systolic BP > _____ • Idea – Pick BP cutoff to minimize p-value for c2 • What does “signifiance” mean now? Multiple testing • 50 different BPs in data, 49 ways to split • Sunday football highlights always look good! • If he shoots enough times, even a 95% free throw shooter will miss. • Tried 49 splits, each has 5% chance of declaring significance even if there’s no relationship. Multiple testing a= Pr{ falsely reject hypothesis 2} (miss on 2nd shot) a= Pr{ falsely reject hypothesis 1} (miss on 1st shot) Pr{ falsely reject one or the other} < 2a Desired: 0.05 probabilty or less Solution: use a = 0.05/2 Or – compare 2(p-value) to 0.05 Multiple testing • • • • • • 50 different BPs in data, m=49 ways to split Multiply p-value by 49 Bonferroni – original idea Kass – apply to data mining (trees) Stop splitting if minimum p-value is large. For m splits, logworth becomes -log10(m*p-value)  ! ! ! Goals • Split if diversity in parent “node” > summed diversities in child nodes • Observations should be – Homogeneous (not diverse) within leaves – Different between leaves – Leaves should be diverse Validation • Traditional stats – small dataset, need all observations to estimate parameters of interest. • Data mining – loads of data, can afford “holdout sample” • Variation: n-fold cross validation – Randomly divide data into n sets – Estimate on n-1, validate on 1 – Repeat n times, using each set as holdout. Pruning • Grow bushy tree on the “fit data” • Classify holdout data • Likely farthest out branches do not improve, possibly hurt fit on holdout data • Prune non-helpful branches. • What is “helpful”? What is good discriminator criterion? Goals • Want diversity in parent “node” > summed diversities in child nodes • Goal is to reduce diversity within leaves • Goal is to maximize differences between leaves • Use validation average squared error, proportion correct decisions, etc. • Costs (profits) may enter the picture for splitting or pruning. Accounting for Costs • Pardon me (sir, ma’am) can you spare some change? • Say “sir” to male +$2.00 • Say “ma’am” to female +$5.00 • Say “sir” to female -$1.00 (balm for slapped face) • Say “ma’am” to male -$10.00 (nose splint) Including Probabilities Leaf has Pr(M)=.7, Pr(F)=.3. You say: Sir Ma’am True Gender M 0.7 (2) 0.7 (-10) 0.3 (5) F +$1.10 -$5.50 Expected profit is 2(0.7)-1(0.3) = $1.10 if I say “sir” Expected profit is -7+1.5 = -$5.50 (a loss) if I say “Ma’am” Weight leaf profits by leaf size (# obsns.) and sum Prune (and split) to maximize profits. Additional Ideas • Forests – Draw samples with replacement (bootstrap) and grow multiple trees. • Random Forests – Randomly sample the “features” (predictors) and build multiple trees. • Classify new point in each tree then average the probabilities, or take a plurality vote from the trees Lift 3.3 * Cumulative Lift Chart - Go from leaf of most to least predicted 1 response. Predicted response high ------------------------------------- low - Lift is proportion responding in first p% overall population response rate Regression Trees • Continuous response Y • Predicted response Pi constant in regions i=1, …, 5 Predict 80 Predict 50 X2 Predict 130 Predict 100 X1 Predict 20 • Predict Pi in cell i. • Yij jth response in cell i. • Split to minimize sum of squared prediction errors Si Sj (Yij-Pi)2 Predict 80 Predict 50 Predict 130 Predict 100 Predict 20 • Predict Pi in cell i. • Yij jth response in cell i. • Split to minimize Si Sj (Yij-Pi)2 Real data example: Traffic accidents in Portugal* Y = injury induced “cost to society” Help - I ran Into a “tree” Help - I ran Into a “tree” * Tree developed by Guilhermina Torrao, (used with permission) NCSU Institute for Transportation Research & Education Another tool – Regression If the Life Line is long and deep, then this represents a long life full of vitality and health. A short line, if strong and deep, also shows great vitality in your life and the ability to overcome health problems. However, if the line is short and shallow, then your life may have the tendency to be controlled by others http://www.ofesite.com/spirit/palm/lines/linelife.htm Wilson & Mather JAMA 229 (1974) X=life line length Y=age at death Result: Predicted Age at Death = 79.24 – 1.367(lifeline) (Is this “real”??? Is this repeatable???) We Use LEAST SQUARES Squared residuals sum to 9609 proc reg data=life; model age=line; run; Parameter Estimates Variable DF Intercept 1 Line 1 Parameter Estimate 79.23341 -1.36697 Standard Error 14.83229 1.59782 t Value Pr > |t| 5.34 <.0001 -0.86 0.3965 Area 0.19825 Area 0.19825 0.39650 -0.86 0.86 Logistic Regression • • • • “Trees” seem to be main tool. Logistic – another classifier Older – “tried & true” method Predict probability of response from input variables (“Features”) • Linear regression gives infinite range of predictions • 0 < probability < 1 so not linear regression. Example: Seat Fabric Ignition • Flame exposure time = X • Ignited Y=1, did not ignite Y=0 – Y=0, X= 3, 5, 9 10 , 13, 16 – Y=1, X = 11, 12 14, 15, 17, 25, 30 • Q=(1-p1)(1-p2)(1-p3)(1-p4)p5p6(1-p7)p8p9(1p10)p11p12p13 • p’s all different pi=f(a+bXi) • Find a,b to maximize Q(a,b) • Logistic idea: Map p in (0,1) to L in whole real line • Use L = ln(p/(1-p)) • Model L as linear in temperature, e.g. • Predicted L = a + b(temperature) • Given temperature X, compute L(x)=a+bX then p = eL/(1+eL) • p(i) = ea+bXi/(1+ea+bXi) • Write p(i) if response, 1-p(i) if not • Multiply all n of these together, find a,b to maximize Generate Q for array of (a,b) values DATA LIKELIHOOD; ARRAY Y(14) Y1-Y14; ARRAY X(14) X1-X14; DO I=1 TO 14; INPUT X(I) y(I) @@; END; DO A = -3 TO -2 BY .025; DO B = 0.2 TO 0.3 BY .0025; Q=1; DO i=1 TO 14; L=A+B*X(i); P=EXP(L)/(1+EXP(L)); IF Y(i)=1 THEN Q=Q*P; ELSE Q=Q*(1-P); END; IF Q<0.0006 THEN Q=0.0006; OUTPUT; END;END; CARDS; 3 0 5 0 7 1 9 0 10 0 11 1 12 1 13 0 14 1 15 1 16 0 17 1 25 1 30 1 ; Likelihood function (Q) -2.6 0.23 Concordant pair  Discordant Pair IGNITION DATA The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter Intercept TIME DF 1 1 Estimate -2.5879 0.2346 Standard Error 1.8469 0.1502 Wald Chi-Square 1.9633 2.4388 Pr > ChiSq 0.1612 0.1184 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 79.2 20.8 0.0 48 Somers' D Gamma Tau-a c 0.583 0.583 0.308 0.792 Example: Shuttle Missions • • • • • O-rings failed in Challenger disaster Low temperature Prior flights “erosion” and “blowby” in O-rings Feature: Temperature at liftoff Target: problem (1) - erosion or blowby vs. no problem (0) Example: Framingham • X=age • Y=1 if heart trouble, 0 otherwise Framingham The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Parameter DF Intercept age 1 1 Standard Wald Estimate Error Chi-Square -5.4639 0.0630 0.5563 0.0110 96.4711 32.6152 Pr>ChiSq <.0001 <.0001 Neural Networks • Very flexible functions • “Hidden Layers” • “Multilayer Perceptron” output inputs Logistic function of Logistic functions Of data Arrows represent linear combinations of “basis functions,” e.g. logistic curves (hyperbolic tangents) p1 p2 p3 b1 b2 Y b3 Example: Y = a + b1 p1 + b2 p2 + b3 p3 Y = 4 + p1+ 2 p2 - 4 p3 Unsupervised Learning • We have the “features” (predictors) • We do NOT have the response even on a training data set (UNsupervised) • Clustering – Agglomerative • Start with each point separated – Divisive • Start with all points in one cluster then spilt – Direct • State # clusters beforehand EM  PROC FASTCLUS • Step 1 – find (50) “seeds” as separated as possible • Step 2 – cluster points to nearest seed – Drift: As points are added, change seed (centroid) to average of each coordinate – Alternatively: Make full pass then recompute seed and iterate. • Step 3 – aggregate clusters using Ward’s method Clusters as Created As Clustered – PROC FASTCLUS TEXT MINING Hypothetical collection of news releases (“corpus”) : release 1: Did the NCAA investigate the basketball scores and vote for sanctions? release 2: Republicans voted for and Democrats voted against it for the win. (etc.) Compute word counts: NCAA basketball score vote Republican Democrat win Release 1 1 1 1 1 0 0 0 Release 2 0 0 0 2 1 1 1 Text Mining Mini-Example: Word counts in 16 e-mails --------------------------------words----------------------------------------- d o c u m e n t E l e c t i o n P r e s i d e n t 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 5 0 8 0 10 2 4 26 19 2 16 14 1 8 6 2 9 0 6 3 1 13 22 0 19 17 0 R e p u b l i c a n B a s k e t b a l l D e m o c r a t V o t e r s N C A A 10 9 0 7 4 9 1 4 9 10 0 21 12 4 12 5 14 0 16 5 13 16 2 11 14 0 0 21 6 4 0 12 0 5 0 2 16 9 1 13 20 3 0 2 2 14 0 19 1 4 20 12 3 9 19 6 1 0 12 2 15 5 12 9 6 0 12 0 0 9 L i a r T o u r n a m e n t S p e e c h 5 9 0 12 2 20 13 0 24 14 0 16 12 3 3 0 16 3 17 0 20 12 4 10 16 4 5 8 8 12 4 15 3 18 0 9 30 22 12 12 9 0 W i n s S c o r e _ V S c o r e _ N 18 12 24 22 9 13 0 3 9 3 17 0 6 3 15 9 19 8 0 9 1 0 10 1 23 0 1 10 21 0 30 2 1 14 6 0 14 0 8 2 4 20 Eigenvalues of the Correlation Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 Eigenvalue Difference 7.10954264 2.30455155 1.00292318 0.76887967 0.55817886 0.45732963 0.30169451 0.16772870 0.16271459 0.1192580 0.0303509 0.0159719 0.0008758 4.80499109 1.30162837 0.23404351 0.21070080 0.10084923 0.15563511 0.13396581 0.00501411 0.04345658 0.08890707 0.01437903 0.01509610 Proportion Cumulative 0.5469 0.1773 0.0771 0.0591 0.0429 0.0352 0.0232 0.0129 0.0125 0.0092 0.0023 0.0012 0.0001 Prin 2 Prin 1 0.5469 0.7242 0.8013 0.8605 0.9034 0.9386 0.9618 0.9747 0.9872 0.9964 0.9987 0.9999 1.0000 55% of the variation in these 13-dimensional vectors occurs in one dimension. Variable Prin1 Basketball NCAA Tournament Score_V Score_N Wins -.320074 -.314093 -.277484 -.134625 -.120083 -.080110 Speech Voters Liar Election Republican President Democrat 0.273525 0.294129 0.309145 0.315647 0.318973 0.333439 0.336873 Eigenvalues of the Correlation Matrix 1 2 3 4 5 6 7 8 9 10 11 12 13 Eigenvalue Difference 7.10954264 2.30455155 1.00292318 0.76887967 0.55817886 0.45732963 0.30169451 0.16772870 0.16271459 0.1192580 0.0303509 0.0159719 0.0008758 4.80499109 1.30162837 0.23404351 0.21070080 0.10084923 0.15563511 0.13396581 0.00501411 0.04345658 0.08890707 0.01437903 0.01509610 Proportion 0.5469 0.1773 0.0771 0.0591 0.0429 0.0352 0.0232 0.0129 0.0125 0.0092 0.0023 0.0012 0.0001 Cumulative 0.5469 0.7242 0.8013 0.8605 0.9034 0.9386 0.9618 0.9747 0.9872 0.9964 0.9987 0.9999 1.0000 Prin 2 Prin 1 Prin1 coordinate = .707(word1) – .707(word2) 55% of the variation in these 13-dimensional vectors occurs in one dimension. Variable Prin1 Basketball NCAA Tournament Score_V Score_N Wins -.320074 -.314093 -.277484 -.134625 -.120083 -.080110 Speech Voters Liar Election Republican President Democrat 0.273525 0.294129 0.309145 0.315647 0.318973 0.333439 0.336873 d o c u m e n t C L U S T E R 3 11 5 14 7 8 1 1 1 1 1 1 P r i n 1 -3.63815 -3.02803 -2.98347 -2.48381 -2.37638 -1.79370 E l e c t i o n P r e s i d e n t R e p u b l i c a n B a s k e t b a l l L i a r T o u r n a m e n t D e m o c r a t V o t e r s S p e e c h N C A A 0 2 0 1 2 4 2 0 0 0 3 1 0 0 4 4 1 4 14 14 16 21 13 16 0 1 0 3 0 2 2 3 0 6 1 4 12 12 15 9 12 9 0 0 2 3 13 0 16 16 17 8 20 12 8 6 6 9 22 17 19 13 10 9 9 7 10 12 21 9 12 5 5 0 11 0 0 2 6 4 5 12 9 20 13 16 0 2 19 14 12 19 9 20 1 0 5 2 0 0 0 6 5 9 20 12 14 12 16 24 3 0 0 3 10 5 4 4 W i n s S c o r e _ V S c o r e _ N 4 12 3 0 0 9 24 17 9 3 0 3 19 23 0 10 1 0 30 8 1 20 6 0 8 12 18 15 22 9 12 30 18 12 13 22 3 6 0 9 15 9 9 8 1 1 0 10 21 0 14 2 0 4 2 14 (biggest gap) 1 2 6 4 10 13 12 9 2 2 2 2 2 2 2 2 -0.00738 0.48514 1.54559 1.59833 2.49069 3.16620 3.48420 3.54077 20 5 10 8 19 14 16 26 PROC CLUSTER (single linkage) agrees ! Cluster 2 Cluster 1 Plot of Prin1*Prin2$document. Symbol points to label. Prin1 ‚ ‚ 4 ˆ ‚ ‚ > 12 > 9 ‚ > 13 3 ˆ ‚ ‚ > 10 ‚ 2 ˆ ‚ ‚ 4 <> 6 ‚ 1 ˆ ‚ ‚ > 2 ‚ 0 ˆ > 1 ‚ ‚ ‚ -1 ˆ ‚ ‚ ‚ > 8 -2 ˆ ‚ ‚ > 7 > 14 ‚ -3 ˆ > 5 > 11 ‚ ‚ ‚ > 3 -4 ˆ ‚ Šƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒ -3 -2 -1 0 1 2 3 Prin2 Can use two, three or more components (dimensions) • Should always use holdout sample • Perturb coefficients to optimize fit (fit data) – Nonlinear search algorithms • Eliminate unnecessary complexity using holdout data. • Other basis sets – Radial Basis Functions – Just normal densities (bell shaped) with adjustable means and variances. Statistics to Data Mining Dictionary Statistics (nerdy) Data Mining (cool) Independent variables Dependent variable Estimation Clustering Features Target Training, Supervised Learning Unsupervised Learning Prediction Slopes, Betas Intercept Scoring Weights (Neural nets) Bias (Neural nets) Composition of Hyperbolic Tangent Functions Radial Basis Function and my personal Type I and Type II Errors Neural Network Normal Density favorite… Confusion Matrix Summary • Data mining – a set of fast stat methods for large data sets • Some new ideas, many old or extensions of old • Some methods: – Trees (recursive splitting) – Logistic Regression – Neural Networks – Association Analysis – Nearest Neighbor – Clustering – Etc. D.A.D. Cubic Clustering Criterion (to decide # of Clusters) • Divide random scatter of (X,Y) points into 4 quadrants • Pooled within cluster variation much less than overall variation • Large variance reduction • Big R-square despite no real clusters • CCC compares random scatter R-square to what you got to decide #clusters • 3 clusters for “macaroni” data.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining Tutorial