Download Applied data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neuroinformatics wikipedia , lookup

Geographic information system wikipedia , lookup

Theoretical computer science wikipedia , lookup

Inverse problem wikipedia , lookup

Generalized linear model wikipedia , lookup

Predictive analytics wikipedia , lookup

Corecursion wikipedia , lookup

Data analysis wikipedia , lookup

Pattern recognition wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Transcript
Applied data mining"
data exploration & regression
Byron C Wallace
Some content from https://github.com/justmarkham/DAT4/blob/master/notebooks/08_linear_regression.ipynb
Data exploration
•  Before any analysis, you should usually start by
calculating simple summary stats and visualization
your data
•  I’ll show some fun visualizations now
flowingdata.com 9/9/11
NYTimes 7/26/11
Exploratory Data Analysis (EDA)
Get a general sense of the data –  means, medians, quintiles, histograms, boxplots
"
Data-driven (model-free)
"
Interactive and visual
–  Humans are the best pattern recognizers
–  Use more than 2 dimensions!
•  x,y,z, space, color, time …
"
Especially useful in early stages of data mining
–  detect outliers (e.g. assess data quality)
–  test assumptions (e.g. normal distributions or skewed?)
–  identify useful raw data & transforms (e.g. log(x))
Always look at your data
Single Variable Visualization
Histogram:
–  shows center, variability, skewness, modality, –  outliers, or strange patterns.
–  bins matter
ful
e
r
a
c
But be
and
s
e
x
a
with
scales!
Smoothed Histograms:
Density plots
KDEs
sum and normalize
To notebook!
Boxplots
•  By convention, top and bottom of box
are 1st and 3rd quartiles
•  Shows a lot of information about a
variable in one plot
– 
– 
– 
– 
– 
Median
IQR
Outliers
Range
Skewness"
•  Negatives
–  Overplotting –  Hard to tell distributional shape
–  no standard implementation in software
(many options for whiskers, outliers)
Time Series
If your data has a temporal component, be sure to exploit it
summer
peaks
steady growth
trend
New Year bumps
Data Mining 2011 - Volinsky - Columbia University
Spatial Data
•  If your data has a
geographic component,
be sure to exploit it
•  Data from cities/states/zip
cods – easy to get lat/
long
•  Can plot as scatterplot
OK, your turn.
1. Download the plotting-and-viz.ipynb notebook
from Canvas (under in-class-exercises/plottingand-viz.ipynb) or from GitHub.
2. Work through this (execute each line) – make sure
you understand what’s going on! EXPERIMENT
and change things, etc.! "
- If anything is unclear or you just have questions about
anything -- wave me down."
3. Also complete the interspersed exercise.
Supervised learning
labeled
data L
learner
unlabeled
data U
predictive
model
Classification v Regression
•  Classification
–  Predict discrete labels (spam / not spam)
•  Regression
–  Predict continuous outcomes (e.g., house price)
Regression
Fitting lines
The aim
•  We would like to find a line that minimizes
the residuals (errors)
–  Usually in terms of the sum of squared errors
•  This is called the line of best fit and it
minimizes the sum of squared residuals
How to pick a line, given points?
rms of yi, xi, b0, and b1. This gives us:
Select to minimize "
ei = of
y i squared
− b0 − b1 xierrors
sum
xpression back into Equation (2), we get
N
SSE = ∑ ( y i − b0 − b1 xi )
2
i =1
ple size for the data. It is this expression that w
h respect to b0 and b1. Let’s start by taking the
ct to the regression constant, b0, i.e.,
Regression
Grus says
0
1
-
= y-- 1 x
= corr(x, y) * stddev (y) / stddev (x) … but where did these come from?
Other estimation methods
•  So we saw (almost) how to solve this
analytically – but not always so easy
•  Iterative methods, such as gradient
descent, are widely used instead
–  Simple, easy to implement, work well
Gradient descent
The basic idea: to find a (local)
function minimum, move in the
negative direction of the
gradient
o choose θ so as to minimize J(θ). To do so, let
hat starts with some “initial guess” for θ, and t
The
meta-algorithm
to make J(θ) smaller, until hopefully we converge
imizes J(θ). Specifically, let’sLoss
consider
the
grad
function
which starts with some initial θ, and repeatedly
do until convergence
∂
θj := θj − α
J(θ).
∂θj
te is simultaneously performed for all values of
stepThis
size is a very natural
called theparameters
learning rate.
takes a step in the direction of steepest decrease o
r to implement this algorithm, we have to work o
vative term on the right hand side. Let’s first wor
In Python…
Assumptions
•  Relationship between y and x is linear
•  The errors do not vary with x
•  Residuals are normally distributed
Hypothesis Testing and Regression
•  We are often interested in establishing the
relationship between variables
•  Formally we can do this within the
regression framework
Does x correlate with y?
We can specify a model:
yi = b0 + b1xi
Then we ask: is b1 = 0 (this becomes H0)
For example
Coefficients:
(Intercept) housing$lotsize
34136.192
6.599
housing$lotsize 6.599e+00 4.458e-01
14.8 <2e-16 ***
2
R
A word of caution on coefficients Anscombe's quartet
Multiple regression
•  Usually we have more than one predictor variable
(jargon alert! we call these features)
– yi = b0 + bxi
•  Assumptions: linear relationship; multivariate
normality; no or little colllinearity
Example in R"
(bathrooms matter!)
Lab time! (Class exercise)
Categorical features:"
dummy coding
•  How we put categorical features into (multiple)
regression models?
•  Example: suppose we want to include sex as a
predictor (“male”/”female”) – how do we put this in
the model? •  We insert an ‘indicator’ variable that is “1” if the
person represented by a row is “female” and 0
otherwise
Dealing with non-linear features
•  Suppose we are predicting the number of days
a house will remain on the market (on board)
•  Consider also predicting an ‘overall health’ or
‘fitness’ score
Interaction features
•  We may believe that some features interact
with others
–  E.g., maybe we want a feature crossing price
bin indicator with a “recently renovated” or “>=
3br” indicator