Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Knowledge Discovery Process; Data Preparation & Preprocessing Bamshad Mobasher DePaul University The Knowledge Discovery Process - The KDD Process 2 Steps in the KDD Process Learning the application domain Translate the business problem into a data mining problem Gathering and integrating of data Cleaning and preprocessing data may be the most resource intensive part Reducing and selecting data find useful features, dimensionality reduction, etc. Choosing functions of data mining summarization, classification, regression, association, clustering, etc. Choosing the mining algorithm(s) Data mining: discover patterns of interest; construct models Evaluate patterns and models Interpretation: analysis of results visualization, alteration, removing redundant patterns, querying Use of discovered knowledge 3 Data Mining: What Kind of Data? Structured Databases relational, object-relational, etc. can use SQL to perform parts of the process e.g., SELECT count(*) FROM Items WHERE type=video GROUP BY category 4 Data Mining: What Kind of Data? Flat Files most common data source can be text (or HTML) or binary may contain transactions, statistical data, measurements, etc. Transactional databases set of records each with a transaction id, time stamp, and a set of items may have an associated “description” file for the items typical source of data used in market basket analysis 5 Data Mining: What Kind of Data? Other Types of Databases legacy databases multimedia databases (usually very high-dimensional) spatial databases (containing geographical information, such as maps, or satellite imaging data, etc.) Time Series Temporal Data (time dependent information such as stock market data; usually very dynamic) World Wide Web basically a large, heterogeneous, distributed database need for new or additional tools and techniques information retrieval, filtering and extraction agents to assist in browsing and filtering Web content, usage, and structure (linkage) mining tools 6 Data Mining: What Kind of Data? Data Warehouses a data warehouse is a repository of data collected from multiple data sources (often heterogeneous) so that analysis can be done under the same unified schema data from the different sources are loaded, cleaned, transformed and integrated allows for interactive analysis and decision-making usually modeled by a multi-dimensional data structure (data cube) to facilitate decision-making aggregate values can be pre-computed and stored along many dimensions each dimension of the data cube contains a hierarchy of values for one attribute data cubes are well suited for fast interactive querying and analysis of data at different conceptual levels, known as On-Line Analytical Processing (OLAP) OLAP operations allow the navigation of data at different levels of abstraction, such as drill-down, roll-up, slice, dice, etc. 7 DM Tasks: Classification Classifying observations/instances into different “given” classes (i.e., classification is “supervised”) example, classifying credit applicants as low, medium, or high risk normally use a training set where all objects are already associated with known class labels classification algorithm learns from the training set and builds a model the model is used to classify new objects Suitable data mining tools Decision Tree algorithms (CHAD, C 5.0, and CART) Memory-based reasoning Neural Network Example (Hypothetical Video Store) build a model of users based their rental history (returning on-time, payments, etc.) observe the current customer’s rental and payment history decide whether should charge a deposit to current customer 8 DM Tasks: Prediction Same as classification except classify according to some predicted or estimated future value In prediction, historical data is used to build a (predictive) model that explains the current observed behavior Model can then be applied to new instances to predict future behavior or forecast the future value of some missing attribute Examples predicting the size of balance transfer if the prospect accepts the offer predicting the load on a Web server in a particular time period Suitable data mining tools Association rule discovery Memory-based reasoning Decision Trees Neural Networks 9 DM Tasks: Affinity Grouping Determine what items often go together (usually in transactional databases) Often Referred to as Market Basket Analysis used in retail for planning arrangement on shelves used for identifying cross-selling opportunities “should” be used to determine best link structure for a Web site Examples people who buy milk and beer also tend to buy diapers people who access pages A and B are likely to place an online order Suitable data mining tools association rule discovery clustering Nearest Neighbor analysis (memory-based reasoning) 10 DM Tasks: Clustering Like classification, clustering is the organization of data into classes however, class labels are unknown and it is up to the clustering algorithm to discover acceptable classes also called unsupervised classification, because the classification is not dictated by given class labels There are many clustering approaches all based on the principle of maximizing the similarity between objects in a same class (intra-class similarity) and minimizing the similarity between objects of different classes (inter-class similarity) Examples: doing market segmentation of customers based on buying patterns and demographic attributes grouping user transactions on a Web site based their navigational patterns 11 Characterization & Discrimination Data characterization is a summarization of general features of objects in a target class The data relevant to a target class are retrieved by a database query and run through a summarization module to extract the essence of the data at different levels of abstraction example: characterize the video store’s customers who regularly rent more than 30 movies a year Data discrimination is used to compare of the general features of objects between two classes comparison relevant features of objects between a target class and a contrasting class example: compare the general characteristics of the customers who rented more than 30 movies in the last year with those who rented less than 5 The techniques used for data discrimination are very similar to the techniques used for data characterization with the exception that data discrimination results include comparative measures 12 Example: Moviegoer Database 13 Example: Moviegoer Database SELECT moviegoers.name, moviegoers.sex, moviegoers.age, sources.source, movies.name FROM movies, sources, moviegoers WHERE sources.source_ID = moviegoers.source_ID AND movies.movie_ID = moviegoers.movie_ID ORDER BY moviegoers.name; moviegoers.name sex age source movies.name Amy Andrew Andy Anne Ansje Beth Bob Brian Candy Cara Cathy Charles Curt David Erica f m m f f f m m f f f m m m f 27 25 34 30 25 30 51 23 29 25 39 25 30 40 23 Oberlin Oberlin Oberlin Oberlin Oberlin Oberlin Pinewoods Oberlin Oberlin Oberlin Mt. Auburn Oberlin MRJ MRJ Mt. Auburn Independence Day 12 Monkeys The Birdcage Trainspotting I Shot Andy Warhol Chain Reaction Schindler's List Super Cop Eddie Phenomenon The Birdcage Kingpin T2 Judgment Day Independence Day Trainspotting 14 Example: Moviegoer Database Classification determine sex based on age, source, and movies seen determine source based on sex, age, and movies seen determine most recent movie based on past movies, age, sex, and source Prediction or Estimation for predict, need a continuous variable (e.g., “age”) predict age as a function of source, sex, and past movies if we had a “rating” field for each moviegoer, we could predict the rating a new moviegoer gives to a movie based on age, sex, past movies, etc. Clustering find groupings of movies that are often seen by the same people find groupings of people that tend to see the same movies clustering might reveal relationships that are not necessarily recorded in the data (e.g., we may find a cluster that is dominated by people with young children; or a cluster of movies that correspond to a particular genre) 15 Example: Moviegoer Database Affinity Grouping market basket analysis (MBA): “which movies go together?” need to create “transactions” for each moviegoer containing movies seen by that moviegoer: name TID Transaction Amy Andrew Andy Anne … 001 002 003 004 … {Independence Day, Trainspotting} {12 Monkeys, The Birdcage, Trainspotting, Phenomenon} {Super Cop, Independence Day, Kingpin} {Trainspotting, Schindler's List} ... may result in association rules such as: {“Phenomenon”, “The Birdcage”} ==> {“Trainspotting”} {“Trainspotting”, “The Birdcage”} ==> {sex = “f”} Sequence Analysis similar to MBA, but order in which items appear in the pattern is important e.g., people who rent “The Birdcage” during a visit tend to rent “Trainspotting” in the next visit. 16 The Knowledge Discovery Process - The KDD Process 17 Data Preprocessing Why do we need to prepare the data? In real world applications data can be inconsistent, incomplete and/or noisy Data entry, data transmission, or data collection problems Discrepancy in naming conventions Duplicated records Incomplete data Contradictions in data What happens when the data can not be trusted? Can the decision be trusted? Decision making is jeopardized Better chance to discover useful knowledge when data is clean 18 Data Preprocessing 19 Data Preprocessing Data Cleaning Data Integration -2,32,100,59,48 -0.02,0.32,1.00,0.59,0.48 Data Transformation Data Reduction 20 Data Cleaning Real-world application data can be incomplete, noisy, and inconsistent No recorded values for some attributes Not considered at time of entry Random errors Irrelevant records or fields Data cleaning attempts to: Fill in missing values Smooth out noisy data Correct inconsistencies Remove irrelevant data 21 Dealing with Missing Values Data is not always available (missing attribute values in records) equipment malfunction deleted due to inconsistency or misunderstanding not considered important at time of data gathering Solving Missing Data Ignore the record with missing values; Fill in the missing values manually; Use a global constant to fill in missing values (NULL, unknown, etc.); Use the attribute value mean to filling missing values of that attribute; Use the attribute mean for all samples belonging to the same class to fill in the missing values; Infer the most probable value to fill in the missing value may need to use methods such as Bayesian classification or decision trees to automatically infer missing attribute values 22 Smoothing Noisy Data The purpose of data smoothing is to eliminate noise Original Data for “price” (after sorting): 4, 8, 15, 21, 21, 24, 25, 28, 34 Binning Each value in a bin is replaced by the mean value of the bin. Partition into equidepth bins Bin1: 4, 8, 15 Bin2: 21, 21, 24 Bin3: 25, 28, 34 means Bin1: 9, 9, 9 Bin2: 22, 22, 22 Bin3: 29, 29, 29 boundaries Bin1: 4, 4, 15 Bin2: 21, 21, 24 Bin3: 25, 25, 34 Min and Max values in each bin are identified (boundaries). Each value in a bin is replaced with the closest boundary value. 23 Smoothing Noisy Data Other Methods Clustering Regression Similar values are organized into groups (clusters). Values falling outside of clusters may be considered “outliers” and may be candidates for elimination. Fit data to a function. Linear regression finds the best line to fit two variables. Multiple regression can handle multiple variables. The values given by the function are used instead of the original values. 24 Smoothing Noisy Data - Example Want to smooth “Temperature” by bin means with bins of size 3: 1. First sort the values of the attribute (keep track of the ID or key so that the transformed values can be replaced in the original table. Divide the data into bins of size 3 (or less in case of last bin). Convert the values in each bin to the mean value for that bin Put the resulting values into the original table 2. 3. 4. ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy 85 85 FALSE 80 90 TRUE 83 78 FALSE 70 96 FALSE 68 80 FALSE 65 70 TRUE 58 65 TRUE 72 95 FALSE 69 70 FALSE 71 80 FALSE 75 70 TRUE 73 90 TRUE 81 75 FALSE 75 80 TRUE ID 7 6 5 9 4 10 8 12 11 14 2 13 3 1 Temperature 58 65 68 69 70 71 72 73 75 75 80 81 83 85 Bin1 Bin2 Bin3 Bin4 Bin5 25 Smoothing Noisy Data - Example ID 7 6 5 9 4 10 8 12 11 14 2 13 3 1 Temperature 58 65 68 69 70 71 72 73 75 75 80 81 83 85 Bin1 Bin2 Bin3 Bin4 Bin5 ID 7 6 5 9 4 10 8 12 11 14 2 13 3 1 Temperature 64 64 64 70 70 70 73 73 73 79 79 79 84 84 Bin1 Bin2 Bin3 Bin4 Bin5 Value of every record in each bin is changed to the mean value for that bin. If it is necessary to keep the value as an integer, then the mean values are rounded to the nearest integer. 26 Smoothing Noisy Data - Example The final table with the new values for the Temperature attribute. ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy 84 85 FALSE 79 90 TRUE 84 78 FALSE 70 96 FALSE 64 80 FALSE 64 70 TRUE 64 65 TRUE 73 95 FALSE 70 70 FALSE 70 80 FALSE 73 70 TRUE 73 90 TRUE 79 75 FALSE 79 80 TRUE 27 Data Integration Data analysis may require a combination of data from multiple sources into a coherent data store Challenges in Data Integration: Schema integration: CID = C_number = Cust-id = cust# Semantic heterogeneity Data value conflicts (different representations or scales, etc.) Synchronization (especially important in Web usage mining) Redundant attributes (redundant if it can be derived from other attributes) -may be able to identify redundancies via correlation analysis: Correlation analysis: Pr(A,B) / (Pr(A).Pr(B)) = 1: independent, > 1: positive correlation, < 1: negative correlation. Meta-data is often necessary for successful data integration 28 Normalization Min-max normalization: linear transformation from v to v’ v’ = [(v - min)/(max - min)] x (newmax - newmin) + newmin Ex: transform $30000 between [10000..45000] into [0..1] ==> [(30000 – 10000) / 35000] x (1 - 0) + 0 = 0.514 z-score normalization: normalization of v into v’ based on attribute value mean and standard deviation v’ = (v - Mean) / StandardDeviation Normalization by decimal scaling moves the decimal point of v by j positions such that j is the minimum number of positions moved so that absolute maximum value falls in [0..1]. v’ = v / 10j Ex: if v ranges between -56 and 9976, j=4 ==> v’ ranges between -0.0056 and 0.9976 29 Normalization: Example z-score normalization: v’ = (v - Mean) / Stdev Example: normalizing the “Humidity” attribute: Humidity 85 90 78 96 80 70 65 95 70 80 70 90 75 80 Mean = 80.3 Stdev = 9.84 Humidity 0.48 0.99 -0.23 1.60 -0.03 -1.05 -1.55 1.49 -1.05 -0.03 -1.05 0.99 -0.54 -0.03 30 Normalization: Example II Min-Max normalization on an employee database max distance for salary: 100000-19000 = 81000 max distance for age: 52-27 = 25 New min for age and salary = 0; new max for age and salary = 1 ID 1 2 3 4 5 Gender F M M F M Age 27 51 52 33 45 Salary 19,000 64,000 100,000 55,000 45,000 ID 1 2 3 4 5 Gender 1 0 0 1 0 Age 0.00 0.96 1.00 0.24 0.72 Salary 0.00 0.56 1.00 0.44 0.32 31 Data Reduction Data is often too large; reducing data can improve performance Data reduction consists of reducing the representation of the data set while producing the same (or almost the same) results Data reduction includes: Data cube aggregation Dimensionality reduction Discretization Numerosity reduction Regression Histograms Clustering Sampling 32 Data Cube Aggregation Reduce the data to the concept level needed in the analysis Use the smallest (most detailed) level necessary to solve the problem Queries regarding aggregated information should be answered using data cube when possible 33 Dimensionality Reduction Feature selection (i.e., attribute subset selection): Select only the necessary attributes. The goal is to find a minimum set of attributes such that the resulting probability distribution of data classes is as close as possible to the original distribution obtained using all attributes. Exponential number of possibilities. Use heuristics: select local ‘best’ (or most pertinent) attribute, e.g., using information gain, etc. step-wise forward selection {}{A1}{A1, A3}{A1, A3, A5} step-wise backward elimination {A1, A2, A3, A4, A5}{A1, A3, A4, A5} {A1, A3, A5} combining forward selection and backward elimination decision-tree induction 34 Decision Tree Induction 35 Numerocity Reduction Reduction via histograms: Divide data into buckets and store representation of buckets (sum, count, etc.) Reduction via clustering Partition data into clusters based on “closeness” in space Retain representatives of clusters (centroids) and outliers Reduction via sampling Will the patterns in the sample represent the patterns in the data? Random sampling can produce poor results Stratified sample (stratum = group based on attribute value) 36 Sampling Techniques Raw Data Cluster/Stratified Sample Raw Data 37 Discretization 3 Types of attributes nominal - values from an unordered set (also “categorical” attributes) ordinal - values from an ordered set continuous - real numbers (but sometimes also integer values) Discretization is used to reduce the number of values for a given continuous attribute usually done by dividing the range of the attribute into intervals interval labels are then used to replace actual data values Some data mining algorithms only accept categorical attributes and cannot handle a range of continuous attribute value Discretization can also be used to generate concept hierarchies reduce the data by collecting and replacing low level concepts (e.g., numeric values for “age”) by higher level concepts (e.g., “young”, “middle aged”, “old”) 38 Discretization - Example Example: discretizing the “Humidity” attribute using 3 bins. Humidity 85 90 78 96 80 70 65 95 70 80 70 90 75 80 Low = 60-69 Normal = 70-79 High = 80+ Humidity High High Normal High High Normal Low High Normal High Normal High Normal High 39 Converting Categorical Attributes to Numerical Attributes ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy 85 85 FALSE 80 90 TRUE 83 78 FALSE 70 96 FALSE 68 80 FALSE 65 70 TRUE 58 65 TRUE 72 95 FALSE 69 70 FALSE 71 80 FALSE 75 70 TRUE 73 90 TRUE 81 75 FALSE 75 80 TRUE Create separate columns for each value of a categorical attribute (e.g., 3 values for the Outlook attribute and two values of the Windy attribute). There is no change to the numerical attributes. Attributes: Outlook (overcast, rain, sunny) Temperature real Humidity real Windy (true, false) Standard Spreadsheet Format OutLook OutLook OutLook Temp Humidity Windy Windy overcast rain sunny TRUE FALSE 0 0 1 85 85 0 1 0 0 1 80 90 1 0 1 0 0 83 78 0 1 0 1 0 70 96 0 1 0 1 0 68 80 0 1 0 1 0 65 70 1 0 1 0 0 64 65 1 0 . . . . . . . . . . . . . . 40 Visualizing Patterns Example: Cross Tabulation ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy 85 85 FALSE 80 90 TRUE 83 78 FALSE 70 96 FALSE 68 80 FALSE 65 70 TRUE 58 65 TRUE 72 95 FALSE 69 70 FALSE 71 80 FALSE 75 70 TRUE 73 90 TRUE 81 75 FALSE 75 80 TRUE Windy Not Windy Outlook = sunny 2 3 Outlook = rain 2 3 Outlook = overcast 2 2 3 2.5 2 Outlook = sunny 1.5 Outlook = rain Outlook = overcast 1 0.5 0 Windy Not Windy 41 Evaluating Models To train and evaluate models, data are often divided into three sets: the training set, the test set, and the evaluation set Training Set is used to build the initial model may need to “enrich the data” to get enough of the special cases Test Set is used to adjust the initial model models can be tweaked to be less idiosyncrasies to the training data and can be adapted for a more general model idea is to prevent “over-training” (i.e., finding patterns where none exist). Evaluation Set is used to evaluate the model performance 42 Training Sets The training set will be used to train the models most important consideration: need to cover the full range of values for all the features that the model might encounter good idea to have several examples for each value of a categorical feature and for a range of values for numerical features Data Enrichment training set should have sufficient number of examples of rare events a random sample of available data is not sufficient since common examples will swamp the rare examples training set needs to over-sample the rare cases so that the model can learn the features for rare events instead of predicting everything to be common outputs 43 Test and Evaluation Sets Reading too much into the training set (overfitting) common problem with most data mining algorithms resulting model works well on the training set but performs miserably on unseen data test set can be used to “tweak” the initial model, and to remove unnecessary inputs or features Evaluations Set for Final Performance Evaluation test set can not be used to evaluate model performance for future unseen data error rate on the evaluation is a good predictor of error rate on unseen data Insufficient data to divide into three disjoint sets? In such cases, validation techniques can play a major role Cross Validation Bootstrap Validation 44 Cross Validation Cross validation is a heuristic that works as follows randomly divide the data into n folds, each with approximately the same number of records create n models using the same algorithms and training parameters; each model is trained with n-1 folds of the data and tested on the remaining fold can be used to find the best algorithm and its optimal training parameter Steps in Cross Validation 1. Divide the available data into a training set and an evaluation set 2. Split the training data into n folds 3. Select an architecture (algorithm) and training parameters 4. Train and test n models 5. Repeat step 2 to 4 using different architectures and parameters 6. Select the best model 7. Use all the training data to train the model 8. Assess the final model using the evaluation set 45 Bootstrap Validation Based on the statistical procedure of sampling with replacement data set of n instances is sampled n times (with replacement) to give another data set of n instances since some elements will be repeated, there will be elements in the original data set that are not picked these remaining instances are used as the test set How many instances in the test set? Probability of not getting picked in one sampling = 1 - 1/n Pr(not getting picked in n samples) = (1 -1/n)n = e-1 = 0.368 so, for large data set, test set will contain about 36.8% of instances to compensate for smaller training sample (63.2%), test set error rate is combined with the re-substitution error in training set: e = (0.632 * e test instance) + (0.368 * e training instance) Bootstrap validation increases variance that can occur in each fold 46 Measuring Effectiveness Predictive models are measured based on the accuracy of their predictions on unseen data Classification and Prediction Tasks accuracy measured in terms of error rate (usually % of records classified incorrectly) error rate on a pre-classified evaluation set estimates the real error rate can also use cross-validation methods discussed before Estimation Effectiveness difference between predicted scores and the actual results (from evaluation set) the accuracy of the model is measured in terms of variance (i.e., average of the squared differences) to be able to express this in terms of the original units, compute the standard deviation (i.e., square root of the variance) ( p1 a1 ) 2 ( pn an ) 2 e n 47 Ordinal or Nominal Outputs When the output field is ordinal or nominal (e.g., in two-class prediction), we use the classification table, the so-called confusion matrix to evaluate the resulting model Example Predicted Class Actual Class T F Total T 18 3 21 F 2 15 17 Total 20 18 38 Overall correct classification rate = (18 + 15) / 38 = 87% Given T, correct classification rate = 18 / 20 = 90% Given F, correct classification rate = 15 / 18 = 83% 48 Measuring Effectiveness Market Basket Analysis MBA may be used for estimation or prediction (e.g., “people who buy milk and diapers tend to also buy beer”) confidence: percentage of the time “beer” occurs in transaction that contain “milk” and “diapers” (i.e., conditional probability) support: percentage of the time “milk”, “diaper”, and “beer” occurred together in the whole training set (i.e., prior probability) Distance-Based Techniques clustering and memory-based reasoning: a measure of distance is used to evaluate the “closeness” or “similarity” of a point to cluster centers or a neighborhood regression: line of best fit minimizes the sum of the distances between the line and the observations often for numeric variables, can use Euclidean distance measure (square root of the sum of the squares of the difference along each dimension) 49 Measuring Effectiveness: Lift Charts usually used for classification, but can be adopted to other methods measure change in conditional probability of a target class when going from the general population (full test set) to a biased sample: lift Pr(classt | sample) / Pr(classt | population) Example: suppose expected response rate to a direct mailing campaign is 5% in the training set use classifier to assign “yes” or “no” value to a target class “predicted to respond” the “yes” group will contain a higher proportion of actual responders than the test set suppose the “yes” group (our biased sample) contains 50% actual responders this gives lift of 10 = 0.5 / 0.05 What if the lift sample is too small No. of respondents need to increase the sample size trade-off between lift and sample size lift Sample size 50