Download Slide - Chrissnijders

Dealing with data, crunching numbers Aim of this lecture • Get acquainted with the building blocks of data-mining • Get acquainted with the difficulties or issues in data mining • Decent data-mining = knowledge + work • Top data-mining = knowledge, work, sweat, and a good handful of luck Simple or complex models The SuperCruncher book considers two ways of using data: • Human vs computer -> often (relatively) simple models (so it is not the case that people in general are outperformed by models because the models are so complex) • Using data to make (surprising or better) predictions -> often more complex models Datamining (super crunching, sort of) “the process of extracting patterns from data” or: Knowledge Discovery in Data (KDD) Basically, multiple regression (from the 1800s) is one way to mine data. Usually, the term data mining is used when the approach is more ‘data driven’. Examples: see the Super Crunchers book Sources of (artificial) intelligence • Reasoning versus learning (or deduction vs induction) • Learning from data – – – – – – – – Patient data Customer records Stock prices Piano music Criminal mug shots Websites Robot perceptions Etc. Data mining tasks • Undirected, explorative, descriptive, ‘unsupervised’ data mining – Matching & search – Profile & rule extraction – Clustering & segmentation; dimension reduction • Directed, predictive, ‘supervised’ data mining – Predictive modeling Data mining - history • • • • • • • Bayesian inference (1700s) Multiple regression (1800s) Instance based learning (1900 [?]) Neural networks, genetic algorithms (1950s) Decision trees (1960s) Support vector machines (1980s) Random forests (1990s) (and many other methods, that sometimes combine the previous ones) General course of events in data mining • Getting the data … (representativity!) • Preprocessing – this is also where part of the magic is • Analysis • Validation / prediction Preprocessing data (and why is this necessary anyway?) • Creating data sets – – – – Reducing computer time Feature extraction Flattening Longitudinal data set up … Setting up the data (think about creating a data file from a server log file) • On inputs (choosing X-variables) – Attribute selection – Attribute construction – … Creating variables with substance (e.g. factor analysis) • On input values (massaging X-variables) – – – – – Outlier removal / clipping Normalization Creating dummies Missing values imputation …. THIS IS A LOT OF WORK: Based on knowledge of DM… … AND of your topic (and a bit of luck won’t hurt) Software (mainly analysis, not much pre-processing [yet]) (and lots of others; see www.kdnuggets.com) Kaggle < show Kaggle website here> Some analytical procedures (that can be used instead of multiple regression) 1. 2. 3. 4. Other regression flavors Decision trees Instance based learning Neural networks Sideline - defining model fit (it need not be R2) • For classifying: % correct, perhaps weighted because some errors are worse than others, area under the ROC curve, … • For predicting: R2, least squares, least absolute difference, etc For instance, what if errors of a certain magnitude are acceptable, and others are not? Alternatives to regression: Regression variants Regression variants All: use weighted average of X-vars to predict Y. - Optimize something else e.g.: “median regression” minimizes |y-yp| instead of (y-yp)2 - Robust regression - MARS (create weights w on basis of fit, re-estimate with w; repeat until convergence) (multi-adaptive regression splines) - And many others … Note: this is something different than using logistic regression, or multi-level regression... Alternatives to regression: Decision trees Decision trees (Perhaps an example on a blackboard helps) Decision trees • Very popular: easy to understand, implement, use, and calculate • Risk: overfitting. There is no obvious “best” tree. • Used for classification mainly, but can be extended to do regression analysis (“regression trees”) Decision trees: issues • How to choose the splitting variable, and where to split? • Why not just calculate the optimal tree?  not obvious which one this is • If x1 and x2 are the top splitting variables, are these also the most important predictors? • If x3 does not end up as a splitting variable, should we conclude that x3 is not important? • Sensitivity to outliers http://www.geocities.com/ adotsaha/CTree/CtreeinExc el.html AnswerTree / CHAID Alternatives to regression: Instance based learning Instance based learning Slides taken from http://www.autonlab.org/tutorials/ by Andrew Moore Alternatives to regression: Neural networks Neural networks Inspired by neuronal computation in the brain (McCullough & Pitts 1943 (!)) Input (attributes) are coded as activation on the input layer neurons, activation feeds forward through network of weighted links between neurons and causes activations on the output neurons. Algorithm learns to find optimal weight using the training instances and a general learning rule. (MANY different variants exist) Tricks of the trade (many of which we have no clue about why they work [!]) Some tricks of the trade (1) “Bagging”: sample with replacement from your data to generate new data, and weigh the different models you estimate Amazingly, even simple estimators can lead to very good predictions Some tricks of the trade (2) “Boosting” • Apply estimator • Calculate weights per case: higher if case badly predicted • Back to first, until estimator only just as good as random • Predict using the stored weights Amazingly, even simple estimators can lead to very good predictions Combining methods (“meta-learning”) Having lots of methods gives rise to new ones… More weird stuff You can show that in principle, different kinds of models are equivalent. But this does not mean that it does not matter which you choose. Weighted averages of (good and bad) prediction models often outperform the best prediction models (“ensembling”).  There are several strange tricks around, and we do not really know why they work ... Validation and prediction The risk of overfitting (Slides borrowed from Andrew Moore) … because our predictions on new data will be terrible! First way out: create a test set • Randomly choose, say, 30% of your data to be the test set, remainder is training set • Run regressions on training set • Estimate future performance on test set Verdict: Second way out: k-fold cross validation -How to decide on the k-value? -What is the eventual model? A thriving community… Art or science (or luck, or sweat)? Example Results Predicting Survival for Head & Neck Cancer Can this be generalized at all? Hmm … other than some basic rules this is not really understood yet. A winner’s tale: “We normalized the numerical variables by range, keeping the sparsity. For the categorical variables, we coded them using at most 11 binary columns for each variable. For each categorical variable, we generated a binary feature for each of the ten most common values, encoding whether the instance had this value or not. The eleventh column encoded whether the instance had a value that was not among the top ten most common values. We removed constant attributes, as well as duplicate attributes. We replaced the missing values by mean for numerical attributes, and coded them as a separate value for discrete attributes. We also added a separate column for each numeric attribute with missing values, indicating wether the value was missing or not. We also tried another approach for imputing missing values based on KNN. On the large data set we discretized the 100 numerical variables that had the highest mutual information with the target into 10 bins, and added them as extra features. Because we noticed that some of the most predictive attributes were not linearly correlated with the targets, we build shallow decision trees (2-4 levels deep) using single numerical attributes and used their predictions as extra features. We also build shallow decision trees using two features at a time and used their prediction as an extra feature in the hope of capturing some non-additive interactions among features. We also generated additional features using a co-clustering based technique.”

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide - Chrissnijders