Download Midterm Review

Midterm Review 1-Intro • Data Mining vs. Statistics – Predictive v. experimental; hypotheses vs data-driven • Different types of data • Data Mining pitfalls – With lots of data you can find anything • Data privacy and security – Good and bad examples 2- EDA and Visualization • Good visualization is good analysis • Examples of vis – – – – – 1-d, 2-d, multivariate Histograms, boxplots, scatterplots, density estimates, etc Overplotting with many points Conditional plots (small multiples) Good, bad examples 3- Data mining concepts • Preparing data for analysis – How to deal with missing data? – What are good transformations? – How to deal with outliers • Data reduction – Reducing n: sampling, subsetting – Reducing p: • Principal components: finding projections that preserve variance – Scree plot shows how much variance is accounted for in the PC • MDS: – Needs a distance matrix – Mimimizes ‘stress function’ – mostly used for visualization and EDA • In-vs-out of sample evaluation – In-sample: must penalize for complexity – Out-of-sample: use cross-validation to evaluate predictive performance 3- Data mining concepts • Complexity/Performance tradeoff • Evaluating Classification models – Accuracy (how many did I get right): not the best choice – Precision/recall or Sensitivity/specificity tradeoff – Selecting different thresholds for ROC curve. 4-Regression • Linear regression – What is it, what are the assumptions, how do you check them – Model selection • Exhaustive or Greedy (forward/backward selection) search • Extensions of Linear regression – Non-linear in parameters, linear in form – Generalized Linear Models • Logisitic regression • Poisson regression – Shrinkage • Ridge regression • Lasso regression • Profile plots show the trace of parameter estimates – Principal component regression – Nonparametric models • Smoothing splines 5-Classification • Categorical or binary response – ‘supervised’ learning • LDA: fit a parametric model to each class • Classification (decision) trees – – – – Binary splits on any predictor X Best split found algorithmically by gini or entropy to maximize purity Best size can be found via cross validation Can be unstable • K-Nearest Neighbors – Tradeoff of large/small k • Probabilistic models – Bayes error rate: best possible error if model is correct – Naïve Bayes • Independence assumption on p(xi|c) 6-Clustering • No response variable – ‘unsupervised’ learning • Needs distance measures – Euclidean, cosine, jaccard, edit, ordinal and categorical • K-means – Select initial solution – Classify points, than re-calculate means • Hierarchical clustering – Solutions for all k from 1 to n – Dendrogram effective visualization – Different distance functions (links) will result in different clusterings • Probabilistic – Mixture models fit using EM algorithm – Model based clustering

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Midterm Review