Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE 8392 SPRING 1999 DATA MINING: CORE TOPICS Overview / Statistical Foundation Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 (214) 768-3087 fax: (214) 768-3085 email: [email protected] www: http://www.seas.smu.edu/~mhd January 1999 CSE 8392 Spring 1999 35 DATA MINING APPLICATION: MARKETING • Building, Using, and Managing the Data Warehouse, Edited by Ramon Barquin and Herb Edelstein, Prentice Hall PTR, 1997, Ch 8. • Fig 8-1, p139 • Ex p 138 • Modeling goal – Pick all (and only) buyers (Precision/recall) – Lift: Ratio of percent of accurate responders – Cumulative Lift- Fig 8-6, p 146 (Deciles sorted based on most likely to buy as predicted by model.) CSE 8392 Spring 1999 36 DEVELOPING A MODEL • DM requires fitting a model (pattern) to the data • Solving Data Mining Problems Through Pattern Recognition, by Ruby L. Kennedy, Yuchun Lee, Benjamin Van Roy, Christopher D. Reed, and Richard D. Lippmann, Prentice Hall PTR, 19951997, Section 1.7 (pp 1-10 - 1-20). • Fig 1-7, p 1-19 • May preprocess data to reduce overhead. However, must be careful to avoid introducing a bias. CSE 8392 Spring 1999 37 PREDICTIVE DATA MINING • Predictive Data Mining, by Sholom M. Weiss and Nitin Indurkhya, Morgan Kaufmann, 1998. • Table 1.1 p8 • Data Reduction CSE 8392 Spring 1999 38 DM Human Participation • Determine how to transform/reduce data • Identify important features to model • Correctly interpret results CSE 8392 Spring 1999 39 FIXED MODELS • Fixed formula describing how output is derived from input. Applications fully understood - no need to look at data. • Ex: Fixed threshold for loans CSE 8392 Spring 1999 40 PARAMETRIC MODELS • Good idea of how the model should be described, but not exact. Needs some data • “Explicit mathematical equations characterize the structure of the relationship between inputs and outputs, but a few parameters are unspecified.” (p 1-14,Kennedy) • Training Sets - Pick parameters by looking at data. n c • Ex: Linear Regression ̂ j ij y i i 1 • Fig 8-2, p140 (Barquin) CSE 8392 Spring 1999 41 NONPARAMETRIC MODELS • Rely on data examination to understand the model. (A.k.a. Data-Driven Models) • Large amounts of data. • Premise - Observations found in current data will be true in future. • May preprocess data based on knowledge which you have. • Methods rather than models • Ex: Nearest Neighbor,Neural Nets, Decision-Trees • Fig 8-3, p142 (Barquin) (More accurately reflects training sets than linear model) CSE 8392 Spring 1999 42 ERROR? • Bias - “Difference in error between the best solution and the proposed solution” (p 47, Weiss) • Variance - “Expected difference in error between a solution found for a single sample and the average solution obtained over many random samples” (p 48, Weiss) • Causes – – – – – Data Warehouse/Reduction/Transformation Survey Data/Sampling Medical Studies with volunteers Erroneous assumptions about the data Problem Simplification also introduces errors CSE 8392 Spring 1999 43 STATISTICAL PERSPECTIVES ON DATA MINING (Elder and Pregibon) • Development of Statistical Methods – Common theme: increases in memory and processor capabilities influence statistical methods • 1960s (Robust/Resistant) – – – – – – Estimators sensitive to contamination Develop new estimators Identify and study causes of errors Reflects realism - data does not obey mathematics Robustness removes limitations of narrow models No work on how to use these new/improved estimators CSE 8392 Spring 1999 44 • Early 1970s – – – – EDA: Exploratory Data Analysis Insights and modeling data driven Look at the data first??? (p 85, Fayyad) Data = Fit + Residual (Fayyad formula 4.1.1, p. 85) • Structure and noise; Iterative – Graphical methods assist in visualization – Data transformation: Reexpression/Splitting (p86) • Late 1970s – Generalized linear models (nonlinear, not normal dist) – EM algorithm • Solves estimation problem with incomplete data • Treat as incomplete for “computational purposes” CSE 8392 Spring 1999 45 • Early 1980s – Resampling methods • Replace n observations with estimates (pseudovalues pi ) ˆ (n k ) ˆ p n i all i p pi n • Jackknife estimate: – Bias reduction tool – Estimate of error in estimate CSE 8392 Spring 1999 46 • Late 1980s – Globally linear locally linear – Scatterpoint smoothing • Early 1990s – Projection pursuit and squashing – Focus shifts from model estimation to model selection CSE 8392 Spring 1999 47 Statistical Perspective • Interpretability • Characterization of uncertainty • Borrowing strength – Artificial stability • Examine model for – residuals, – diagnostics, – parameter covariances • Regularization – Ockham’s razor – Simpler model yields best generalization CSE 8392 Spring 1999 48 • Common Statistical Methods – – – – – – n Linear models Nonparametric methods Projection pursuit Neural networks Polynomial networks Decision trees • Classification • Estimation – Splines CSE 8392 Spring 1999 49 • Summary – Statistical Approach Strengths • Uncertainty controlled • Explicit assumptions • Stability • Interpretability • Quantification of Variance – Statistical methods are essential knowledge discovery tools CSE 8392 Spring 1999 50