Download Midterm Review

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Midterm Review
1-Intro
• Data Mining vs. Statistics
– Predictive v. experimental; hypotheses vs data-driven
• Different types of data
• Data Mining pitfalls
– With lots of data you can find anything
• Data privacy and security
– Good and bad examples
2- EDA and Visualization
• Good visualization is good analysis
• Examples of vis
–
–
–
–
–
1-d, 2-d, multivariate
Histograms, boxplots, scatterplots, density estimates, etc
Overplotting with many points
Conditional plots (small multiples)
Good, bad examples
3- Data mining concepts
• Preparing data for analysis
– How to deal with missing data?
– What are good transformations?
– How to deal with outliers
• Data reduction
– Reducing n: sampling, subsetting
– Reducing p:
• Principal components: finding projections that preserve variance
– Scree plot shows how much variance is accounted for in the PC
• MDS:
– Needs a distance matrix
– Mimimizes ‘stress function’
– mostly used for visualization and EDA
• In-vs-out of sample evaluation
– In-sample: must penalize for complexity
– Out-of-sample: use cross-validation to evaluate predictive performance
3- Data mining concepts
• Complexity/Performance tradeoff
• Evaluating Classification models
– Accuracy (how many did I get right): not the best choice
– Precision/recall or Sensitivity/specificity tradeoff
– Selecting different thresholds for ROC curve.
4-Regression
• Linear regression
– What is it, what are the assumptions, how do you check them
– Model selection
• Exhaustive or Greedy (forward/backward selection) search
• Extensions of Linear regression
– Non-linear in parameters, linear in form
– Generalized Linear Models
• Logisitic regression
• Poisson regression
– Shrinkage
• Ridge regression
• Lasso regression
• Profile plots show the trace of parameter estimates
– Principal component regression
– Nonparametric models
• Smoothing splines
5-Classification
• Categorical or binary response – ‘supervised’ learning
• LDA: fit a parametric model to each class
• Classification (decision) trees
–
–
–
–
Binary splits on any predictor X
Best split found algorithmically by gini or entropy to maximize purity
Best size can be found via cross validation
Can be unstable
• K-Nearest Neighbors
– Tradeoff of large/small k
• Probabilistic models
– Bayes error rate: best possible error if model is correct
– Naïve Bayes
• Independence assumption on p(xi|c)
6-Clustering
• No response variable – ‘unsupervised’ learning
• Needs distance measures
– Euclidean, cosine, jaccard, edit, ordinal and categorical
• K-means
– Select initial solution
– Classify points, than re-calculate means
• Hierarchical clustering
– Solutions for all k from 1 to n
– Dendrogram effective visualization
– Different distance functions (links) will result in different clusterings
• Probabilistic
– Mixture models fit using EM algorithm
– Model based clustering