Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 COMP 3503 Data Preparation and Meta Data with Daniel L. Silver 2 The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Pre-processing Data Consolidation Patterns & Models Warehouse Consolidated Data Data Sources p(x)=0.02 Prepared Data 3 Selection and Pre-processing Core Problems & Approaches Problems: • • • identification of relevant data representation of data search for valid pattern or model Probability of sale Age Approaches: Income • top-down verification by expert OLAP • interactive visualization of data/models • * bottom-up induction from data * Data Mining 4 Selection and Pre-processing As much effort is expended preparing data as applying a data mining tool Iterative approach: prepare develop data model Data Mining phase will benefit from any insight that leads to improved set of attributes Representation can facilitate or frustrate the Search for the most accurate model (hypothesis) Spreadsheet, OLAP and visualization tools are very helpful 5 Selection and Pre-processing Data and Variable Characteristics Three basic variable data types: • • • Nominal (catagorical) qualitative values marital status = single, married, divorced, widowed Ordinal (ranked) values have rank order grade = A,B,C Interval values have order plus a metric scale for comparisons and arithmetic operations temperature = 2, 10, 20 or 10.5, 15.2, 19.3 date = 12Aug99, 13Feb02 6 Selection and Pre-processing Data and Variable Characteristics Variables can be either discrete or continuous • In addition data can be of various formats: • Only interval numeric values are continuous Text, numeric, logical, binary, date, money, … Data mining software will vary in its ability to accept these types and formats 7 Selection and Pre-processing Data Selection and Sampling Select response (dependent) variable • • determine prior probability of categories deal with volume bias issues Select predictor (independent) attributes Generate a set of examples • • choose sampling method (random, stratified) consider sample complexity: How many examples do I need to develop a reliable model? Handle outliers (obvious exceptions) • Remove the row or replace with imputed value 8 Selection and Pre-processing Data Reduction The curse of dimensionality number of attributes / number of values Reduce number of attributes • • • remove redundant and correlating attributes combine attributes (logically, arithmetically, statistically (Principal Components Analysis) Reduce attribute value ranges • • group symbolic discrete values quantize continuous numeric values 9 Selection and Pre-processing Preliminary Statistical Analysis Coefficient of correlation , r, measures the linear dependence of two variables X and Y, -1 < r < +1; r 2 shows magnitude of r. pos r neg r ? Y X X X Select attributes which correlate strongly with the response variable and pertain to problem 10 Selection and Pre-processing Preliminary Statistical Analysis Remove or combine attributes that correlate with each other or try to de-correlate through transformation Factor Analysis - ANOVA can be used to compare relative contribution of each attribute to outcomes Principal Component Analysis - generates variates - linear combinations of original attributes Tools such as Minitab, SAS, SPSS can be used 11 Selection and Pre-processing Transform data • de-correlate and normalize values • map time-series data to static representation Encode data • representation must be appropriate for the Data Mining tool which will be used • continue to reduce attribute dimensionality where possible without loss of information Use spreadsheet functions or transformation and encoding software within DM tool 12 Selection and Pre-processing Transformation and Encoding Discrete variable values If necessary transform to discrete numeric values Example, encode the value 4 as follows: • • • Nominal: one-of-N code (0 1 0 0 0) - five inputs Ordinal: thermometer code ( 1 1 1 1 0) - five inputs Interval: real value (0.4)* - one input Consider relationship between values • (single, married, divorce) vs. (youth, adult, senior) 13 Selection and Pre-processing Transformation and Encoding Continuous numeric values De-correlate via normalization of values: Min-max: x’ = [(newmax – newmin) (x – min) / (max – min)] + newmin • Euclidean: x’ = x / sqrt(sum of all x^2) • Percentage: x’ = x/(sum of all x) • Variance based: x’ = (x - (mean of all x))/variance Scale values using a linear transform if data is uniformly distributed or use non-linear (log, power) if skewed distribution • 14 Selection and Pre-processing Transformation and Encoding Other encodings for continuous numeric values Example: 1.6 meters could be encoded as: • Single real-valued number (0.16)* o • Bits of a binary number (010000) o • BAD! Modeling system must now learning binary encoding One-of-N quantized intervals (0 1 0 0 0) o • OK! But what if data is skewed NOT GREAT! Presents discontinuties Distributed (fuzzy) overlapping intervals ( 0.3 0.8 0.1 0.0 0.0) o BEST! Deals well with skewing but no discontinuities 15 Selection and Pre-processing Extracting Features from a Single Variable From dates: • Day, week, month, quarter, holiday, weekend day From time: • Hour, minute, morning, afternoon, evening From address: • Postal Code components mean something Telephone number: • NPA-NNX-9999 16 Selection and Pre-processing Time Series Data Of great interest to business, science, medicine Time series data has high dimensional • T1, T2, … , Tn Approaches to summary/characterization • • • Current value = Ti Moving average = MAi = ( Ti + Ti-1 + Ti-2) / 3 Trends = Ti - MAi or = MAi - MAi-k 17 Selection and Pre-processing Textual Data A difficult data type • Can have very high dimensions • freeform, open-ended, syntax -vs- semantics thousands of potential values Approaches to summary/characterization • • • define a fixed set of N word classes based on frequency analysis map word combinations to one of the N classes automate via specialty software Data Warehousing and Preparation Access to Recent Information www.datawarehousing.com DWI - Data Warehouse Institute www.dw-institute.com Wikipedia http://en.wikipedia.org/wiki/Data_warehouse DW Information Centre http://www.dwinfocenter.org A DW Tutorial: http://www.planet-sourcecode.com/vb/scripts/ShowCode.asp?lngWId=5&txtCodeId=378 Text Books: Data Warehousing texts by W.H. Inmon, Claudia Imhoff, Ralph Kimball D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. 18 19 THE END [email protected]