Download Data Prep and Meta Data

1 COMP 3503 Data Preparation and Meta Data with Daniel L. Silver 2 The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Pre-processing Data Consolidation Patterns & Models Warehouse Consolidated Data Data Sources p(x)=0.02 Prepared Data 3 Selection and Pre-processing Core Problems & Approaches  Problems: • • • identification of relevant data representation of data search for valid pattern or model Probability of sale Age  Approaches: Income • top-down verification by expert OLAP • interactive visualization of data/models • * bottom-up induction from data * Data Mining 4 Selection and Pre-processing As much effort is expended preparing data as applying a data mining tool  Iterative approach: prepare develop data model  Data Mining phase will benefit from any insight that leads to improved set of attributes  Representation can facilitate or frustrate the Search for the most accurate model (hypothesis)  Spreadsheet, OLAP and visualization tools are very helpful 5 Selection and Pre-processing Data and Variable Characteristics  Three basic variable data types: • • • Nominal (catagorical) qualitative values marital status = single, married, divorced, widowed Ordinal (ranked) values have rank order grade = A,B,C Interval values have order plus a metric scale for comparisons and arithmetic operations temperature = 2, 10, 20 or 10.5, 15.2, 19.3 date = 12Aug99, 13Feb02 6 Selection and Pre-processing Data and Variable Characteristics  Variables can be either discrete or continuous •  In addition data can be of various formats: •  Only interval numeric values are continuous Text, numeric, logical, binary, date, money, … Data mining software will vary in its ability to accept these types and formats 7 Selection and Pre-processing Data Selection and Sampling  Select response (dependent) variable • • determine prior probability of categories deal with volume bias issues Select predictor (independent) attributes  Generate a set of examples  • •  choose sampling method (random, stratified) consider sample complexity: How many examples do I need to develop a reliable model? Handle outliers (obvious exceptions) • Remove the row or replace with imputed value 8 Selection and Pre-processing Data Reduction  The curse of dimensionality number of attributes / number of values  Reduce number of attributes • • •  remove redundant and correlating attributes combine attributes (logically, arithmetically, statistically (Principal Components Analysis) Reduce attribute value ranges • • group symbolic discrete values quantize continuous numeric values 9 Selection and Pre-processing Preliminary Statistical Analysis  Coefficient of correlation , r, measures the linear dependence of two variables X and Y, -1 < r < +1; r 2 shows magnitude of r. pos r neg r ? Y  X X X Select attributes which correlate strongly with the response variable and pertain to problem 10 Selection and Pre-processing Preliminary Statistical Analysis Remove or combine attributes that correlate with each other or try to de-correlate through transformation  Factor Analysis - ANOVA can be used to compare  relative contribution of each attribute to outcomes  Principal Component Analysis - generates variates - linear combinations of original attributes Tools such as Minitab, SAS, SPSS can be used 11 Selection and Pre-processing  Transform data • de-correlate and normalize values • map time-series data to static representation  Encode data • representation must be appropriate for the Data Mining tool which will be used • continue to reduce attribute dimensionality where possible without loss of information Use spreadsheet functions or transformation and encoding software within DM tool 12 Selection and Pre-processing Transformation and Encoding Discrete variable values  If necessary transform to discrete numeric values  Example, encode the value 4 as follows: • • •  Nominal: one-of-N code (0 1 0 0 0) - five inputs Ordinal: thermometer code ( 1 1 1 1 0) - five inputs Interval: real value (0.4)* - one input Consider relationship between values • (single, married, divorce) vs. (youth, adult, senior) 13 Selection and Pre-processing Transformation and Encoding Continuous numeric values  De-correlate via normalization of values: Min-max: x’ = [(newmax – newmin) (x – min) / (max – min)] + newmin • Euclidean: x’ = x / sqrt(sum of all x^2) • Percentage: x’ = x/(sum of all x) • Variance based: x’ = (x - (mean of all x))/variance  Scale values using a linear transform if data is uniformly distributed or use non-linear (log, power) if skewed distribution • 14 Selection and Pre-processing Transformation and Encoding Other encodings for continuous numeric values Example: 1.6 meters could be encoded as: • Single real-valued number (0.16)* o • Bits of a binary number (010000) o • BAD! Modeling system must now learning binary encoding One-of-N quantized intervals (0 1 0 0 0) o • OK! But what if data is skewed NOT GREAT! Presents discontinuties Distributed (fuzzy) overlapping intervals ( 0.3 0.8 0.1 0.0 0.0) o BEST! Deals well with skewing but no discontinuities 15 Selection and Pre-processing Extracting Features from a Single Variable  From dates: • Day, week, month, quarter, holiday, weekend day  From time: • Hour, minute, morning, afternoon, evening  From address: • Postal Code components mean something  Telephone number: • NPA-NNX-9999 16 Selection and Pre-processing Time Series Data Of great interest to business, science, medicine  Time series data has high dimensional  •  T1, T2, … , Tn Approaches to summary/characterization • • • Current value = Ti Moving average = MAi = ( Ti + Ti-1 + Ti-2) / 3 Trends = Ti - MAi or = MAi - MAi-k 17 Selection and Pre-processing Textual Data  A difficult data type •  Can have very high dimensions •  freeform, open-ended, syntax -vs- semantics thousands of potential values Approaches to summary/characterization • • • define a fixed set of N word classes based on frequency analysis map word combinations to one of the N classes automate via specialty software Data Warehousing and Preparation Access to Recent Information       www.datawarehousing.com DWI - Data Warehouse Institute www.dw-institute.com Wikipedia http://en.wikipedia.org/wiki/Data_warehouse DW Information Centre http://www.dwinfocenter.org A DW Tutorial: http://www.planet-sourcecode.com/vb/scripts/ShowCode.asp?lngWId=5&txtCodeId=378 Text Books: Data Warehousing texts by W.H. Inmon, Claudia Imhoff, Ralph Kimball D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. 18 19 THE END [email protected]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Prep and Meta Data