Download Applied data mining

Applied data mining" data exploration & regression Byron C Wallace Some content from https://github.com/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb Data exploration •  Before any analysis, you should usually start by calculating simple summary stats and visualization your data •  I’ll show some fun visualizations now flowingdata.com 9/9/11 NYTimes 7/26/11 Exploratory Data Analysis (EDA) Get a general sense of the data –  means, medians, quintiles, histograms, boxplots " Data-driven (model-free) " Interactive and visual –  Humans are the best pattern recognizers –  Use more than 2 dimensions! •  x,y,z, space, color, time … " Especially useful in early stages of data mining –  detect outliers (e.g. assess data quality) –  test assumptions (e.g. normal distributions or skewed?) –  identify useful raw data & transforms (e.g. log(x)) Always look at your data Single Variable Visualization Histogram: –  shows center, variability, skewness, modality, –  outliers, or strange patterns. –  bins matter ful e r a c But be and s e x a with scales! Smoothed Histograms: Density plots KDEs sum and normalize To notebook! Boxplots •  By convention, top and bottom of box are 1st and 3rd quartiles •  Shows a lot of information about a variable in one plot –  –  –  –  –  Median IQR Outliers Range Skewness" •  Negatives –  Overplotting –  Hard to tell distributional shape –  no standard implementation in software (many options for whiskers, outliers) Time Series If your data has a temporal component, be sure to exploit it summer peaks steady growth trend New Year bumps Data Mining 2011 - Volinsky - Columbia University Spatial Data •  If your data has a geographic component, be sure to exploit it •  Data from cities/states/zip cods – easy to get lat/ long •  Can plot as scatterplot OK, your turn. 1. Download the plotting-and-viz.ipynb notebook from Canvas (under in-class-exercises/plottingand-viz.ipynb) or from GitHub. 2. Work through this (execute each line) – make sure you understand what’s going on! EXPERIMENT and change things, etc.! " - If anything is unclear or you just have questions about anything -- wave me down." 3. Also complete the interspersed exercise. Supervised learning labeled data L learner unlabeled data U predictive model Classification v Regression •  Classification –  Predict discrete labels (spam / not spam) •  Regression –  Predict continuous outcomes (e.g., house price) Regression Fitting lines The aim •  We would like to find a line that minimizes the residuals (errors) –  Usually in terms of the sum of squared errors •  This is called the line of best fit and it minimizes the sum of squared residuals How to pick a line, given points? rms of yi, xi, b0, and b1. This gives us: Select to minimize " ei = of y i squared − b0 − b1 xierrors sum xpression back into Equation (2), we get N SSE = ∑ ( y i − b0 − b1 xi ) 2 i =1 ple size for the data. It is this expression that w h respect to b0 and b1. Let’s start by taking the ct to the regression constant, b0, i.e., Regression Grus says 0 1 - = y-- 1 x = corr(x, y) * stddev (y) / stddev (x) … but where did these come from? Other estimation methods •  So we saw (almost) how to solve this analytically – but not always so easy •  Iterative methods, such as gradient descent, are widely used instead –  Simple, easy to implement, work well Gradient descent The basic idea: to find a (local) function minimum, move in the negative direction of the gradient o choose θ so as to minimize J(θ). To do so, let hat starts with some “initial guess” for θ, and t The meta-algorithm to make J(θ) smaller, until hopefully we converge imizes J(θ). Specifically, let’sLoss consider the grad function which starts with some initial θ, and repeatedly do until convergence ∂ θj := θj − α J(θ). ∂θj te is simultaneously performed for all values of stepThis size is a very natural called theparameters learning rate. takes a step in the direction of steepest decrease o r to implement this algorithm, we have to work o vative term on the right hand side. Let’s first wor In Python… Assumptions •  Relationship between y and x is linear •  The errors do not vary with x •  Residuals are normally distributed Hypothesis Testing and Regression •  We are often interested in establishing the relationship between variables •  Formally we can do this within the regression framework Does x correlate with y? We can specify a model: yi = b0 + b1xi Then we ask: is b1 = 0 (this becomes H0) For example Coefficients: (Intercept) housing$lotsize 34136.192 6.599 housing$lotsize 6.599e+00 4.458e-01 14.8 <2e-16 *** 2 R A word of caution on coefficients Anscombe's quartet Multiple regression •  Usually we have more than one predictor variable (jargon alert! we call these features) – yi = b0 + bxi •  Assumptions: linear relationship; multivariate normality; no or little colllinearity Example in R" (bathrooms matter!) Lab time! (Class exercise) Categorical features:" dummy coding •  How we put categorical features into (multiple) regression models? •  Example: suppose we want to include sex as a predictor (“male”/”female”) – how do we put this in the model? •  We insert an ‘indicator’ variable that is “1” if the person represented by a row is “female” and 0 otherwise Dealing with non-linear features •  Suppose we are predicting the number of days a house will remain on the market (on board) •  Consider also predicting an ‘overall health’ or ‘fitness’ score Interaction features •  We may believe that some features interact with others –  E.g., maybe we want a feature crossing price bin indicator with a “recently renovated” or “>= 3br” indicator

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Applied data mining