Download 4, Section 4.6 - La Salle University

Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6 Numeric Attributes • Numeric prediction and/ or numeric attributes as predictors • Linear regression is well established statistical technique – Designed to predict numeric value based on numeric attributes – Determines optimal set of coefficients for linear equation: • pred = w0 + w1a1 + w2a2 + … + wnan – Optimal means prediction errors squared is minimized – For data mining, this would be done on training data so that it can be tested on test data – I hope that a CSC major could read a statistics book and then write the code to do this – However, there is no need to do this, since this method is so available, unless you are seeking to create an improved Example • <Show Basketball Spreadsheet – Baskball sheet • NOTE – input values, weights, prediction vs actual • <Show testReg sheet – test on separate instances • NOTE – how it did – prediction vs actual – difference, correlation Using Regression for Classification • Perform regression for each class • Set output to be predicted = 1 for training instances that belong to a class • Set output to be predicted = 0 for training instances that do NOT belong to the class • Do this for each class, and you will have an “membership function” equation for each class • On test, plug new instance into each equation, and highest value produced will be the prediction to make Example • • • • • • <Show discretized sheet NOTE – prep of data – into low, medium, high NOTE – Weights for 3 regressions, high, med, low <Show Test sheet NOTE – Calcs Hi, Med, Low (doesn’t do that well, suspect that the data may not be from same source (NBA), and that the discretization was a bit of a problem (very few low) More sophisticated • Do as many pairwise competitions as necessary • Training – two classes against each other: – temporarily toss training instances that are not one of the two – Set output = 1 for class to be predicted and –1 for other • Test – do all pairwise competitions, winner of each gets a vote – – – – – E.g. say – Medium beats High Medium beats Low High beats Low Medium wins 2-1-0 • Conservative approach would be to predict nothing if no prediction dominates In Context • Has been used for decades for various applications (e.g. social science research) • Bias – only searches for linear equations – no squares, cubes etc • To work well, data must fit a linear model – e.g must be “linearly separable” – be able to divide with a line (in 2D, a plane in 3D, a hyperplane in multi-D) • To work well, attributes should not be highly correlated with each other • Depends on numeric attributes Let’s Look at WEKA • Linear Regression with Basketball data • No Correctness measures – Correlations – Error • Discretize Points per minute – Try logistic regression – a categorical prediction approach End Section 4.6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 4, Section 4.6 - La Salle University