* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Applied data mining
Neuroinformatics wikipedia , lookup
Geographic information system wikipedia , lookup
Theoretical computer science wikipedia , lookup
Inverse problem wikipedia , lookup
Generalized linear model wikipedia , lookup
Predictive analytics wikipedia , lookup
Corecursion wikipedia , lookup
Data analysis wikipedia , lookup
Pattern recognition wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Applied data mining" data exploration & regression Byron C Wallace Some content from https://github.com/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb Data exploration • Before any analysis, you should usually start by calculating simple summary stats and visualization your data • I’ll show some fun visualizations now flowingdata.com 9/9/11 NYTimes 7/26/11 Exploratory Data Analysis (EDA) Get a general sense of the data – means, medians, quintiles, histograms, boxplots " Data-driven (model-free) " Interactive and visual – Humans are the best pattern recognizers – Use more than 2 dimensions! • x,y,z, space, color, time … " Especially useful in early stages of data mining – detect outliers (e.g. assess data quality) – test assumptions (e.g. normal distributions or skewed?) – identify useful raw data & transforms (e.g. log(x)) Always look at your data Single Variable Visualization Histogram: – shows center, variability, skewness, modality, – outliers, or strange patterns. – bins matter ful e r a c But be and s e x a with scales! Smoothed Histograms: Density plots KDEs sum and normalize To notebook! Boxplots • By convention, top and bottom of box are 1st and 3rd quartiles • Shows a lot of information about a variable in one plot – – – – – Median IQR Outliers Range Skewness" • Negatives – Overplotting – Hard to tell distributional shape – no standard implementation in software (many options for whiskers, outliers) Time Series If your data has a temporal component, be sure to exploit it summer peaks steady growth trend New Year bumps Data Mining 2011 - Volinsky - Columbia University Spatial Data • If your data has a geographic component, be sure to exploit it • Data from cities/states/zip cods – easy to get lat/ long • Can plot as scatterplot OK, your turn. 1. Download the plotting-and-viz.ipynb notebook from Canvas (under in-class-exercises/plottingand-viz.ipynb) or from GitHub. 2. Work through this (execute each line) – make sure you understand what’s going on! EXPERIMENT and change things, etc.! " - If anything is unclear or you just have questions about anything -- wave me down." 3. Also complete the interspersed exercise. Supervised learning labeled data L learner unlabeled data U predictive model Classification v Regression • Classification – Predict discrete labels (spam / not spam) • Regression – Predict continuous outcomes (e.g., house price) Regression Fitting lines The aim • We would like to find a line that minimizes the residuals (errors) – Usually in terms of the sum of squared errors • This is called the line of best fit and it minimizes the sum of squared residuals How to pick a line, given points? rms of yi, xi, b0, and b1. This gives us: Select to minimize " ei = of y i squared − b0 − b1 xierrors sum xpression back into Equation (2), we get N SSE = ∑ ( y i − b0 − b1 xi ) 2 i =1 ple size for the data. It is this expression that w h respect to b0 and b1. Let’s start by taking the ct to the regression constant, b0, i.e., Regression Grus says 0 1 - = y-- 1 x = corr(x, y) * stddev (y) / stddev (x) … but where did these come from? Other estimation methods • So we saw (almost) how to solve this analytically – but not always so easy • Iterative methods, such as gradient descent, are widely used instead – Simple, easy to implement, work well Gradient descent The basic idea: to find a (local) function minimum, move in the negative direction of the gradient o choose θ so as to minimize J(θ). To do so, let hat starts with some “initial guess” for θ, and t The meta-algorithm to make J(θ) smaller, until hopefully we converge imizes J(θ). Specifically, let’sLoss consider the grad function which starts with some initial θ, and repeatedly do until convergence ∂ θj := θj − α J(θ). ∂θj te is simultaneously performed for all values of stepThis size is a very natural called theparameters learning rate. takes a step in the direction of steepest decrease o r to implement this algorithm, we have to work o vative term on the right hand side. Let’s first wor In Python… Assumptions • Relationship between y and x is linear • The errors do not vary with x • Residuals are normally distributed Hypothesis Testing and Regression • We are often interested in establishing the relationship between variables • Formally we can do this within the regression framework Does x correlate with y? We can specify a model: yi = b0 + b1xi Then we ask: is b1 = 0 (this becomes H0) For example Coefficients: (Intercept) housing$lotsize 34136.192 6.599 housing$lotsize 6.599e+00 4.458e-01 14.8 <2e-16 *** 2 R A word of caution on coefficients Anscombe's quartet Multiple regression • Usually we have more than one predictor variable (jargon alert! we call these features) – yi = b0 + bxi • Assumptions: linear relationship; multivariate normality; no or little colllinearity Example in R" (bathrooms matter!) Lab time! (Class exercise) Categorical features:" dummy coding • How we put categorical features into (multiple) regression models? • Example: suppose we want to include sex as a predictor (“male”/”female”) – how do we put this in the model? • We insert an ‘indicator’ variable that is “1” if the person represented by a row is “female” and 0 otherwise Dealing with non-linear features • Suppose we are predicting the number of days a house will remain on the market (on board) • Consider also predicting an ‘overall health’ or ‘fitness’ score Interaction features • We may believe that some features interact with others – E.g., maybe we want a feature crossing price bin indicator with a “recently renovated” or “>= 3br” indicator